View Program - 2022 Population, Evolutionary, and Quantitative Genetics Conference

Characterizing the genetic substructure of large cohorts has become increasingly important as genetic association studies are extended to massive, increasingly diverse, biobanks. ADMIXTURE and STRUCTURE are widely used unsupervised clustering algorithms for characterizing such ancestral genetic structure. These methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA marker frequencies. The assignments, and clusters, provide an interpretable representation for geneticists to describe population substructure at the sample level. However, with the rapidly increasing size of population biobanks and the growing numbers of variants genotyped (or sequenced) per sample, such traditional methods become computationally intractable. Multiple runs with different hyperparameters are required to properly depict population clustering using these traditional methods further increasing the computational burden, leading to days of compute. In this work we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, providing similar (or better) clustering, while reducing the compute time by orders of magnitude. Indeed, the equivalent of one month of continuous compute using the current standard algorithm (ADMIXTURE) can be reduced to just hours with Neural ADMIXTURE. In addition, by using a multi-head approach Neural ADMIXTURE can include multiple clustering outputs, providing results equivalent to running standard algorithms many times with different numbers of clusters. Our models can also be stored, allowing later cluster assignment on new data to be performed with a linear computational time and without needing to share the training data. The software implementation of Neural ADMIXTURE can be found at https://github.com/ai-sandbox/neural-admixture.

Neural ADMIXTURE: rapid population clustering with autoencoders