Development and evaluation of machine learning approaches for genetic-based disease prediction in population biobanks
Genetic association studies aim to discover DNA variants associated with complex binary and continuous traits in order to better understand the causal mechanisms of these conditions and enable the development of new treatments and interventions. One of the early aims of genetic association studies was to use associated variants to enable personalised medicine based on an individual’s genetics. However, complex traits are typically influenced by many genetic variants all contributing a small effect on the trait, which limits the utility of individual variants in risk prediction. Genetic risk scores, which aggregate effects from many associated genetic variants, have been demonstrated to have some utility in risk prediction, but these methods assume only additive effects, can include only common variants, and are chosen using a crude p-value threshold following association testing.
My project aims to address this issue by investigating the use of machine learning methods to develop risk prediction algorithms for complex traits. These methods are capable of incorporating multiple genetic models as well as interactions between genetic and non-genetic risk factors, and are capable of making inferences from the data without being told specifically what to look for. For this project I am using data from the UK Biobank, which has collected a range of genetic and non-genetic data for 500,000 participants, and represents a world leading resource for the study of complex disease.