QUEST programme
(QUantitative Early-career Skills Training)

Current Fellow - David Hughes

Variational Approximation Approaches for Efficient Clinical Predictions

Motivation
High-dimensional data is now routinely collected in many medical settings. The data may be high-dimensional due to the number of different variables being measured on a single patient (e.g. genetic or metabolomics studies) or the number of times a variable is repeatedly measured (e.g. longitudinal studies). In addition the number of individuals with available data in a study is now increasing (through the availability of Biobanks for example). One way in which this data can be used is in screening or monitoring patients to determine their risk of developing a disease or requiring an intervention, or to classify patients into various risk groups.

Objective
In recent years joint models for multiple longitudinal (mainly continuous) and time-to-event data have been developed, which make use of mixed models. The popularity of these methods is evidenced by the success of software packages for joint models. These methods provide personalised predictions of a patient's risk of experiencing an event. However, a key limitation of current models is the computational burden required to fit such models. This burden occurs if there is a high number of patients (many thousands instead of a few hundred), many repeated observations per patient, or multiple clinical biomarkers under consideration. These factors have limited the uptake of joint-modelling methodology to generally small sample sizes and only a few (2-4) longitudinal biomarkers. In the machine learning literature variational approximation methods have been shown to give fast and accurate estimates of model parameters in a variety of settings. Recent statistical work has introduced these methods in univariate generalised linear mixed models, and in some genetic settings.

My proposal is to derive a mean-field variational Bayes (MFVB) approach to estimating joint models for multiple longitudinal biomarkers of different type (continuous, counts, binary, etc.) and time-to-event data. The basic idea of the approach is to avoid estimating the model parameters from the full likelihood function but instead, split the likelihood function into a product of more tractable functions, which can then be more easily estimated. The rationale is that by sacrificing a little in terms of model accuracy, we gain the ability to fit multivariate joint model within a reasonable period of time (seconds/minutes instead of hours/days). Making it practicable to fit such models in a reasonable time frame, will allow personalised risk-prediction and diagnostics to be performed in real-time and make a substantial contribution towards stratified medical treatment and intelligent medical systems. Clinical Applications: The models developed will be used to assess the influence of covariates, and the correlation between at least 10 markers over time that have known links to a patient developing sight threatening diabetic retinopathy (STDR). Considering that the ISDR dataset contains over 20,000 patients, even fitting a univariate mixed model to assess the evolution over time is computationally intensive with current methods. Fitting a model to assess the correlation over time between all of these markers would be infeasible. My approach will lead to an improved understanding of the complex relationship between various clinical markers and their changes over time.

Training Plan

High-dimensional data is now routinely collected in many settings, due to the number of different variables being measured on a subject, the number of times a variable is measured and the number of individuals in a study. This data can be used for screening patients to determine their risk of disease, or to classify patients into risk groups.

In the machine learning literature variational Bayes approximation methods have been shown to give fast and accurate estimates of model parameters in a variety of settings.

My proposal is to derive a mean-field variational Bayes (MFVB) approach to estimating joint models for multiple longitudinal biomarkers of different types and time-to-event data. Making it practicable to fit such models in a reasonable time frame, will allow personalised risk-prediction and diagnostics to be performed in real-time.

This project will be at the interface of statistics and computer science with significant statistical, methodological, and computational components.

Skills gaps to be addressed
Machine and statistical learning; Software for clinical decision-making; Modelling of multidimensional data structures.

Year 1
I will be based within the Department of Biostatistics. I will derive the MFVB approximation to multivariate generalized linear mixed models, and will apply these new models to a number of clinical datasets. I will gain an understanding of the clinical challenges as well as more experience of the biomedical research environment by meeting with my clinical collaborators and by attending additional departmental research meetings related to the areas of diabetic retinopathy and cancer metabolomics. I will develop skills in machine learning by attending modules run by the Department of Computer Science; Machine Learning and Bioinspired Optimisation (COMP532), ), Data mining and visualization (COMP527).

I plan spend two weeks visiting Professor Matt Wand, a leading expert in variational approximation techniques, in Sydney. This collaboration will help me to develop my statistical and machine learning skills, and will provide excellent guidance for my research proposal. (Cost to visit University of Technology, Sydney: £1400 flight; £2800 subsistence (14 days @ £200))

In addition, I will also interact regularly with the Data Mining and machine learning research group, and the cross-faculty Bayesian Statistics group. These networks will be develop my awareness of current research in the fields related to my research project.

Year 2
I will develop MFVB approximations for joint models of multiple longitudinal markers and time-to-event data, and develop risk prediction models in clinical datasets. I will develop my software skills by completing courses run by Jumping Rivers; R for Big Data (£+VAT), Introduction to Bayesian Inference using RStan (£+VAT). I will learn about optimization techniques in Optimisation (COMP557). I will continue interactions with the clinical departments by presenting my work at departmental seminars and through on-going discussion.

I plan to spend one week visiting Dr Dimitris Rizopoulos at Erasmus University in Rotterdam, to develop further my joint modelling techniques with his leading research team. (Anticipated Costs: £ flight, £1050 subsistence (7 days @£150))

Year 3
Regular meetings with clinical colleagues will take place to discuss the implementation and clinical assessment of the methodology. I will also create an R package containing the code for methods I develop.

During my final year I will participate in the Research Leadership programme at the University of Liverpool. This programme will develop my leadership skills ahead in preparation for leading my own research team, and developing grant proposals.

Supervisory Team
Marta Garcia-Fiñana (Multivariate data modelling, Biostatistics, academic sponsor), Frans Oliehoek (machine learning, Computer Science). Simon Harding (Diabetic Retinopathy risk, Ophthalmology), Chris Probert (Cancer Metabolomics, Cellular and Molecular Physiology)

Expression of Interest by email

31st January 2019

Applications Close

17th May 2019

Interviews

TBC

Start the Fellowship

June 2019 - March 2020