View Program - 2022 Population, Evolutionary, and Quantitative Genetics Conference

Increasing amounts of genotype information put pressure on computational resources. Researchers and other consumers of genotype information who do not have access to powerful computer hardware can thus be at a disadvantage. Devising analysis algorithms that efficiently process large data sets is an important component in the drive to democratize access to information. It can also save time and energy consumption of compute clusters. Estimating similarities among loci (linkage disequilibrium, LD) and individuals (relationship matrices) are ubiquitous steps in numerous analysis pipelines. Time to compute LD among loci using exact algorithms grows linearly with the number of individuals in a data set and quadratically with the number of loci. While optimizing individual operations can yield significant improvements, we need approximate procedures to improve on these undesirable scaling properties. I describe an approach that uses similarity-preserving hashes to summarize genotype data. This allows for sparse LD matrix computation that is almost insensitive to the number of individuals and slows down less than quadratically as the number of loci in the data set increases. Conversely, time to estimate sparse genetic similarity matrices is close to insensitive to the number of loci and grows slowly with the number of individuals. In addition, these algorithms require much less memory and allow for explicit precision-time trade-offs. Software implementing these approaches is freely available on GitHub (https://github.com/tonymugen/vash).

Unreasonably fast estimates of similarity among loci and individuals