Date of Award

2020

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Biology

Abstract

Popular transcriptome imputation methods such as PrediXcan and FUSIon use parametric linear assumptions, and thus are unable to flexibly model the complex genetic architecture of the transcriptome. Although non-linear modeling has been shown to improve imputation performance, replicability and potential cross-population differences have not been adequately studied. Therefore, to optimize imputation performance across global populations, we used the non-linear machine learning (ML) models random forest (RF), support vector regression (SVR), and K nearest neighbor (KNN) to build transcriptome imputation models, and evaluated their performance in comparison to elastic net (EN). We trained gene expression prediction models using genotype and blood monocyte transcriptome data from the Multi-Ethnic Study of Atherosclerosis (MESA) comprising individuals of African, Hispanic, and European ancestries and tested them using genotype and whole blood transcriptome data from the Modeling the Epidemiology Transition Study (METS) comprising individuals of African ancestries. We show that the prediction performance is highest when the training and the testing population share similar ancestries regardless of the prediction algorithm used. While EN generally outperformed RF, SVR, and KNN, we found that RF outperforms EN for some genes, particularly between disparate ancestries, suggesting potential robustness and reduced variability of RF imputation performance across global populations. When applied to a high-density lipoprotein (HDL) phenotype, we show including RF prediction models in PrediXcan reveals potential gene associations missed by EN models. Therefore, by integrating non-linear modeling into PrediXcan and diversifying our training populations to include more global ancestries, we may uncover new genes associated with complex traits. We did not find any significant associations when the prediction models were applied to obesity status and microbiome diversity.

Recommended Citation

Okoro, Paul Chukwuebuka, "Optimizing Gene Expression Prediction and Omics Integration in Populations of African Ancestry" (2020). Master's Theses. 4345.
https://ecommons.luc.edu/luc_theses/4345

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

Copyright Statement

Download

Included in

Bioinformatics Commons

COinS

Master's Theses

Optimizing Gene Expression Prediction and Omics Integration in Populations of African Ancestry

Date of Award

Degree Type

Degree Name

Department

Abstract

Recommended Citation

Creative Commons License

Copyright Statement

Included in

Submission Tools

Explore

For Contributors

About eCommons

Master's Theses

Optimizing Gene Expression Prediction and Omics Integration in Populations of African Ancestry

Author

Date of Award

Degree Type

Degree Name

Department

Abstract

Recommended Citation

Creative Commons License

Copyright Statement

Included in

Share

Submission Tools

Explore

For Contributors

About eCommons