Date of Award

Fall 9-5-2025

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Bioinformatics & Computational Biology

First Advisor

Heather Wheeler

Abstract

Large biobanks containing whole genome sequences connected to the deidentified electronic health records of hundreds of thousands of individuals are increasingly available to researchers. This has facilitated the development of polygenic risk scores (PRS), which use machine learning to predict risk of disease and other phenotypes from millions of genetic variants. Due to Eurocentricity of available genetic data, PRS have primarily been trained in populations of European descent. Differences in allele frequencies, effect sizes, and linkage disequilibrium patterns between populations can limit the transferability of PRS across different populations. Given that most data are from European genetic ancestry populations, we assessed how PRS trained in predominantly European populations with a small proportion of other ancestral populations like those in the Pan-UK Biobank perform in more diverse ancestral populations like those in All of Us. We also examined PRS performance when scores were trained and tested in subsets of the same dataset (All of Us). PRS-CSx is a Bayesian PRS training method that combines genetic effects across populations via a shared continuous shrinkage prior to improve polygenic prediction in diverse genetic ancestries. We applied PRS-CSx to standing height GWAS summary statistics from five populations from the Pan-UK Biobank and All of Us. Additionally, we performed a grid search of PRS-CSx phi parameters (1e-2, 1e-4, 1e-6) and three different seeds to assess training model variance. To determine how to weight the resulting PRS for each population, we validated the PRS-CSx models using ensemble learning validation and tested them in an independently held-out test set. We evaluated the results by comparing the correlation between PRS-CSx-predicted height and observed height after adjusting for sex at birth, age, and 16 genotypic PCs. In this study, we hypothesized that PRS-CSx models trained in multiple ancestries with diverse data distribution will improve PRS prediction compared to models trained with predominantly European data or trained within a single ancestry for standing height. Contrary to what we hypothesized, PRS-CSx models trained with predominantly European data, but with a larger sample size, results in better predictive performance than models trained with a diverse dataset. However, the African validated and tested models perform similarly regardless of their training dataset. PRS-CSx models performed better than PRS-CS in nearly all populations, demonstrating the utility of training models in diverse population.

Share

COinS