284W Poster - Population Genetics
Wednesday June 08, 8:30 PM - 9:15 PM

Modeling alignment cost in mixed-membership unsupervised genetic clustering


Authors:
Xiran Liu 1; Naama Kopelman 2; Noah Rosenberg 1

Affiliations:
1) Stanford University, Stanford, CA; 2) Holon Institute of Technology, Holon, Israel

Keywords:
Theory & Method Development

Mixed-membership unsupervised clustering is widely used to extract informative patterns from data in many application areas. In population genetics, unsupervised clustering methods such as ADMIXTURE and STRUCTURE have been widely used to infer population structure and ancestry proportions from genetic data. For a shared data set, clustering results produced by different algorithms, or even multiple runs of the same algorithm, can be difficult to compute, as outcomes can differ owing to permutation of the cluster labels, meaningful differences in clustering results, or both. Here, we study the cost of misalignment of mixed-membership unsupervised clustering replicates under a theoretical model of cluster memberships. Using Dirichlet distributions to model membership coefficient vectors, we provide theoretical results quantifying the alignment cost as a function of the Dirichlet parameters and the Hamming permutation difference between replicates. For fixed Dirichlet parameters, the alignment cost is seen to increase with the Hamming distance between permutations. Data sets with low variance across individuals of membership coefficients for specific clusters generally produce high misalignment costs---so that a single optimal permutation has far lower cost than suboptimal permutations. Higher variability in data, as represented by greater variance of membership coefficients, generally results in alignment costs that are similar between the optimal permutation and suboptimal permutations. We demonstrate the application of the theoretical results to data simulated under the Dirichlet model, as well as to membership estimates from inference of human-genetic ancestry. The results can contribute to improving cluster alignment algorithms that seek to find optimal permutations of replicates.