327V Poster Online - Virtual Posters
Tuesday June 07, 11:00 AM - 3:00 PM

Pedigree reconstruction in the era of many thousands of samples


Authors:
Daniel Seidman; Ryan O'Hern; Amy L. Williams

Affiliation: Cornell University Graduate School

Keywords:
Population history

As modern genetic datasets grow in size, the fraction of samples with one or more close relatives in a given dataset increases. These relationships could allow for the construction of massive numbers of pedigrees, but scalable and accurate pedigree reconstruction methods rare. Pedigrees have wide-spread utility as they can improve the quality of phasing and imputation, help trace the origin of alleles, and yield enhanced heritability and linkage studies.
We propose PELICAN, PEdigree reconstruction from LIkelihoods and ConstrAiNts, an algorithm that can rapidly and accurately reconstruct pedigrees using the latent relatives in large datasets. Using likelihoods for specific relationship types of both first and second degree relatives, the algorithm creates a list of sorted potential edges for a pedigree graph. First degree relationship likelihoods, either full-sibling or parent-child, are calculated from identity-by-descent (IBD) regions shared by pairs of individuals. Our algorithm receives likelihoods for second degree relationship types, such as grandparent/grandchild or half-sibling, from a separate algorithm called CREST (Qiao, Sannerud, et al. 2021). It combines those likelihoods with additional likelihoods generated from a kernel density estimator (KDE) trained on simulated relatives to form composite likelihoods for the relationships. PELICAN then adds relationships, in order of likelihood, to the pedigree graph. We impose restrictions on what relationships constitute valid additions to the graph, and the algorithm backtracks from situations where: a new relationship results in implied inbreeding in the last two generations, sets of individuals form a biologically impossible combination of relationships, or the partial pedigree cannot form pedigrees of equal or higher likelihood than those already found. With these restrictions in place, the algorithm investigates all possible pedigrees, but does so without performing the time-consuming process of generating and comparing each one explicitly. In this way, PELICAN is guaranteed to find the maximum composite likelihood pedigree.
As a proof of concept, we applied PELICAN to the UK Biobank dataset’s ~500,000 samples to demonstrate the algorithm’s scalability and provide these pedigrees as a resource to the community. PELICAN performed its analysis in ~3 hours, inferring 11,529 separate pedigrees of more than two samples in size.