Data driven model selection and parameter estimation using semi-automatic approximate Bayesian computation to reconstruct population dynamics from ancient DNA.

Rohrlach, Adam Benjamin

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/85193

Type:	Thesis
Title:	Data driven model selection and parameter estimation using semi-automatic approximate Bayesian computation to reconstruct population dynamics from ancient DNA.
Author:	Rohrlach, Adam Benjamin
Issue Date:	2014
School/Discipline:	School of Mathematical Sciences
Abstract:	Population genetics is a discipline within the biological sciences that is concerned with the change in frequency of types of individuals in a population due to natural selection, mutation, genetic drift and gene flow. Genetic drift is the part of this process explained by random sampling. Important to the process of genetic drift is population structure and so we focus on the recovery of population sizes over time, given a set of DNA sequences. With recent advances in computational power and a growth in the amount of data available, increasingly powerful techniques are being developed for the study of sequence data. Key advances in the early 1980's centred around `the coalescent', a continuous time approximation to the Wright-Fisher model of reproduction, and these advances resulted in Skyline Plot methods for recovering population size estimates over time. Skyline Plots suffer from large variances for the `coalescent' event times, and sources of error common to DNA sequence sampling schemes. Approximate Bayesian Computation (ABC) is a class of likelihood-free methods for statistical inference. ABC techniques can trace their genesis back to the biological sciences due to the complexity of the models for reproduction (and hence the intractability of likelihood calculations). Unfortunately, like Skyline Plots, ABC also suffers from many sources of error, not least of which occurs when we can not use sufficient summary statistics. To considerably reduce the effect of the error related with the use of insufficient summary statistics, we explore a process of semi-automatic summary statistic calculation through the use of `training data' (simulated under the coalescent model). We obtain a training set of data, and fit a linear model (under a Box-Cox transformation) for each parameter of interest, using common summary statistics for DNA sequences as predictor variables. We call these linear combinations of (insufficient) summary statistics the semi-automatic summary statistics, and using a new set of simulations, we perform ABC where a simulation is retained if the predicted parameter values are `close enough' to the predicted parameters for the observed data. We analyse three sets of coalescent simulated data from three population models; the Constant, Exponential and Migration Models, and compare our findings with the corresponding Skyline Plot analyses performed in BEAST. When we simulate data for training our linear model, we must specify a model of population size dynamics, and we explore methods to select a population model, given our data. A common means of model comparison used with ABC analyses is called Bayes Factors. We show that Bayes Factors perform poorly for our data, and highlight a fundamental bias inherent in any model comparison where the probability of a model, given an observed summary statistic, is employed. As an alternative to Bayes Factors, we apply multiple logistic regression (MLR) to classify our observed data into one of a candidate set of possible models. In conjunction with the MLR analysis, we use principal component analysis for visualisation, and introduce a method for attempting to identify when the correct model is not in the candidate model set, or when a classification seems reasonable. We show that this method of classification performs well for the three observed data sets using sensitivity analysis. Due to the early stage of development of our work, we can not use real world data, and so we use a different type of simulation since our method uses coalescent simulations to train the model. We obtain sequence data simulated under a `forward simulation' framework, a type of sequence simulation that looks forward in time. We define a two-step process for analysis that begins with MLR classification, and then, under a model chosen by the MLR classification, uses semi-automatic summary statistic calculation for parameter estimation via ABC. We correctly identify this model of population dynamics, and perform parameter estimation on the data, comparing our results with the corresponding BEAST Skyline Plot analysis.
Advisor:	Bean, Nigel Geoffrey Tuke, Simon Jonathan
Dissertation Note:	Thesis (M.Phil.) -- University of Adelaide, School of Mathematical Sciences, 2014
Keywords:	model selection; ABC; phylogenetic reconstruction; semi-automatic summary statistics
Provenance:	This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals
Appears in Collections:	Research Theses

Files in This Item:

File	Description	Size	Format
01front.pdf		157.82 kB	Adobe PDF	View/Open
02whole.pdf		13.96 MB	Adobe PDF	View/Open
Permissions Restricted Access	Library staff access only	213.41 kB	Adobe PDF	View/Open
Restricted Restricted Access	Library staff access only	14.19 MB	Adobe PDF	View/Open

Show full item record

Adelaide Research & Scholarship