271T Poster - Population Genetics
Thursday June 09, 9:15 PM - 10:00 PM

Tensor decomposition-based feature extraction and classification to detect natural selection from genomic data.


Authors:
Md Ruhul Amin; Mahamudul Hasan; Michael DeGiorgio

Affiliation: Florida Atlantic University

Keywords:
Theory & Method Development

Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy, the ability of organisms to survive at extreme environments such as high altitudes, and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are typically grounded in simple theoretical models that may limit the complexity of settings that they can explore, running the risk of inaccurate predictions as the summary statistics are hand engineered. Due to the renaissance in artificial intelligence, machine and deep learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes across sampled individuals to simultaneously extract important genomic features and achieve high classification accuracy and power for distinguishing selection from neutrality. Yet, limitations of such techniques include difficulty in estimating the number of model parameters and identification of features without regard to their location within an image. As a complementary approach, we consider an alternative feature extraction method, termed tensor decomposition, which falls within a class of dimensionality reduction techniques to extract features from multidimensional data while preserving the latent structure of the data. We apply tensor decomposition to images of haplotypes across sampled individuals, and then use these extracted features as input to classical linear and non-linear machine learning methods. As a proof of concept, we explore the performance of this pipeline on simulated neutral and selective sweep scenarios, and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to missing data, and easy visualization of underlying low-dimensional features uncovered by tensor decomposition. Therefore, our approach is a powerful addition to the toolkit for detecting adaptive processes from genomic data.