Neural ADMIXTURE for rapid genomic clustering | Nature Computational Science

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Nature Computational Science volume 3, pages 621–629 (2023 )Cite this article Quick Setting Agent

Neural ADMIXTURE for rapid genomic clustering | Nature Computational Science

A preprint version of the article is available at bioRxiv.

Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multi-head approach allows Neural ADMIXTURE to offer even further acceleration by computing multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples.

The rapid growth in sequenced human genomes and the proliferation of population-scale biobanks have enabled the creation of increasingly accurate models to predict traits and disease risk using an individual’s genome. However, different predictive models can be required depending on an individual’s genetic ancestry, and this necessitates accurately characterizing genetic cluster composition at the individual level1. Such characterization is also an essential part of most modern population genetics studies and national biobanking efforts2,3. However, many existing algorithms for this task struggle with next-generation sequencing datasets, where both the number of samples and the number of sequenced positions along the genome are much greater than earlier case–control genotyping studies. Scalable algorithms to characterize the population structure of genetic sequences are especially important for more diverse biobanks, themselves needed to correct the extreme imbalance towards European-descent samples in existing studies in order to avoid a new divide in healthcare arising through omitting most of the world’s population from precision health research4.

A common approach for characterizing the population structure within a genetic dataset is to describe each sample as a set of fractional assignments to each cluster. These clusters are centroids found via an unsupervised algorithm in a space spanning the frequencies of each variant. By avoiding the culture-specific labels and subjective constructs (for example, ethnicity) of supervised classification methods5, these unsupervised approaches can better reflect the spectrum of genetic structure across samples. Generally, the input variants are the individual’s sequence of single nucleotide polymorphisms (SNPs), that is, single positions along the genome known to vary between individuals. Smaller datasets of less numerous variants, such as microsatellites, have also been used. There are millions of SNPs in the human genome and most are biallelic (two variants) permitting a binary encoding. For instance, zero could be used to encode the most common (or reference) variant at an SNP position on the genome and one to encode the minority (or alternate) variant. The frequency distribution of these variants will vary between populations due to differing histories: founder events, migration, isolation, and drift.

We present an autoencoder that expands on the clustering method for genomes: ADMIXTURE6,7. ADMIXTURE was developed as a computationally efficient alternative to STRUCTURE8, and we take this pursuit of efficiency now to the next generation of datasets. Our proposed method, Neural ADMIXTURE, follows the same modeling assumptions as ADMIXTURE, but reframes the task as a neural-network-based autoencoder, providing faster computational times, both on graphics and central graphics units (GPUs and on CPUs), while maintaining high-quality assignments.

Neural ADMIXTURE (Fig. 1a) is an interpretable autoencoder with two main components: (1) an encoder, composed of two linear layers with a Gaussian error linear unit (GELU) activation9 in-between, then a softmax activation, which projects a genotype sequence onto a vector representing fractional ancestry assignments for each individual (Q); and (2) a decoder, which is a single linear layer whose weights are restricted to lie between 0 and 1, leading to an interpretable projection matrix that learns the cluster centroids, or equivalently, the average variant frequency at each site for each population (F). Additionally, we introduce Multi-head Neural ADMIXTURE (Fig. 1b), which includes multiple decoders in a single network to obtain results analogous to training ADMIXTURE repeatedly for different numbers of clusters, but needing only a single training for all numbers of clusters desired.

a, Single-head architecture. The input sequence (x) is projected into 64 dimensions using a linear layer (θ1) and processed by a GELU non-linearity (σ1). The cluster assignment estimates Q are computed by feeding the 64-dimensional sequence to a K-neuron layer (parametrized by θ2) activated with a softmax (σ2). Finally, the decoder outputs a reconstruction of the input (\(\tilde{x}\) ) using a linear layer with weights F. Note that the decoder is restricted to this linear architecture to ensure interpretability. b, Simple multi-head example with H = 3. The 64-dimensional hidden vector is copied and processed independently by different sets of weights (\({\theta }_{{2}_{h}}\) ), which yield vectors of different dimensions, corresponding to the different K values. Each different \({Q}_{{K}_{h}}\) matrix is processed independently by different decoder matrices \({F}_{{K}_{h}}\) yielding H different reconstructions. All parameters are optimized jointly in an end-to-end fashion.

Neural ADMIXTURE was trained with a standard binary cross-entropy, leading to an equivalence with the traditional ADMIXTURE model’s objective function (Methods). Two initialization techniques, one based on principal component analysis10,11,12 and the other on archetypal analysis13, were used as an alternative to common network initializations to speed up the training process and improve results (Supplementary section ‘Decoder initialization’). Furthermore, two mechanisms are available to incorporate prior knowledge about the amount of admixture in a dataset by controlling the softness of the cluster assignments: applying L2 regularization during training (Methods) and softmax tempering (Supplementary section ‘Softmax tempering’). Both single-head and multi-head approaches can be adapted to a supervised version that performs regular classification given known training labels (Supplementary section ‘Supervised training‘). The proposed method is fully compatible with the original ADMIXTURE framework, allowing the use of ADMIXTURE results as an initialization for Neural ADMIXTURE parameters (Supplementary section ‘Pretrained mode’), and vice versa. We performed an in-depth evaluation of the proposed method and compared it with competing approaches across multiple datasets, including using simulations from a variety of systems14,15,16,17 and using samples from large-scale, real-world biobanks (Methods, Supplementary Table 1, Supplementary Table 2, and Supplementary section ‘Dataset description’).

Neural ADMIXTURE is systematically faster than alternative algorithms, both on CPU and GPU (Table 1, Supplementary Fig. 1). This speedup is further enhanced when using the Multi-head Neural ADMIXTURE architecture, which can perform clusterings for different K values simultaneously. For example, in the All-Chms dataset, we observed that Neural ADMIXTURE trained in less than 2 min, whereas ADMIXTURE required more than a day. Neural ADMIXTURE performs at least as well as existing algorithms on both predicting the ancestry assignments (Q) and the allele frequencies (F). On average, Neural ADMIXTURE’s Q estimates appear to be more similar to the matrix of known labels than the Q estimates from previous methods (Extended Data Fig. 1).

Table 2 shows the accuracy and time performance of ADMIXTURE and Neural ADMIXTURE on the test data for three different datasets. Both ADMIXTURE and Neural ADMIXTURE are able to generalize and produce consistent assignments on unseen data. However, Neural ADMIXTURE is much faster than ADMIXTURE on both CPU and GPU, because ADMIXTURE must optimize the objective with a fixed F to find Q for unseen data, whereas Neural ADMIXTURE directly learns a function that estimates Q. We note that inference on GPU is extremely fast (generally less than a second for a forward pass); the computational bottleneck comes simply from reading and processing of the data, which could be further addressed.

We visualized the Q estimates of ADMIXTURE and Neural ADMIXTURE on the Chm-22-Sim dataset using pong18 (Fig. 2a–d). The SNP frequencies (the entries in the F matrix) from both models can be observed as projections onto the first two principal components of the training data (Fig. 2e). Neural ADMIXTURE provides harder cluster predictions, with many samples being assigned only to a single population, whereas ADMIXTURE provides softer cluster predictions with partial assignments to multiple clusters. On this dataset, ADMIXTURE does not assign different clusters to Native Americans (AMR) and East Asians (EAS); instead, it partitions Africans (AFR) into two different ancestry clusters (Fig. 2a,b). Neural ADMIXTURE, however, does split AMR and EAS populations (Fig. 2c–e). Depictions of the cluster assignments (Q) of all algorithms on several datasets can be found in Supplementary Figs. 2–5.

a, Q estimates of ADMIXTURE on training data. b, Q estimates of ADMIXTURE on test data. c, Q estimates of Neural ADMIXTURE on training data. d, Q estimates of Neural ADMIXTURE on test data. e, Two-dimensional principal component analysis (PCA) projection of the training data and the matrix F learnt by both ADMIXTURE and Neural ADMIXTURE, which correspond to the cluster centroids. The color of each individual in the PCA represents its ground truth regional label. f, Q estimates of Neural ADMIXTURE on admixed populations not present in the training data. Among the MXL samples, we observe mainly an orange AMR component with a red and yellow component (West Asians (WAS) and Europeans (EUR), respectively). These latter components likely originate from the immigration of Spanish, Morisco, and Sephardic Jewish individuals into Mexico during the colonial period. The PUR samples exhibit EUR, WAS, AMR, and AFR ancestry clusters. The additional AFR component is likely linked to the introduction of enslaved West Africans during the colonial period. In the barplots (used to visualize Q), each vertical bar represents an individual sample and bar color lengths represent the proportion of the sample’s ancestry assigned to that colored cluster. OCE, Oceanians; SAS, South Asians.

We applied Neural ADMIXTURE, trained on Chm-22-Sim, to admixed populations that were not present in the training data: Mexican Ancestry in Los Angeles, California (MXL, 118), and Puerto Ricans in Puerto Rico (PUR, 104) (Fig. 2f).

We evaluated Multi-head Neural ADMIXTURE with Chm-22-Sim (Extended Data Fig. 2) and showed that as the number of clusters increases, each population group gets assigned its own cluster. Furthermore, we showed that Multi-head Neural ADMIXTURE can be successfully applied to closely related populations (Extended Data Fig. 3). Finally, we showed that the proposed method can be applied on real, admixed datasets (Extended Data Fig. 4).

To assess the clustering speed on a very large dataset, we ran Neural ADMIXTURE in its multi-head mode on the entire UK Biobank—a total of 488,377 samples—and using 147,604 SNPs subsetted to remove linkage disequilibrium (LD) by pruning the full set19. Neural ADMIXTURE was able to process the complete dataset within 11 h, providing results from K = 2 to K = 6, whereas ADMIXTURE would take about a month to do the same, given that it took 5.5 days to provide results for K = 2. Traditional techniques such as ADMIXTURE are thus too slow for such large biobanks, particularly because multiple additional runs with different parameters and subsets of data are generally needed in a study. Neural ADMIXTURE was trained without regularization (λ = 0, Methods) and using the PCK-means initialization (Supplementary Algorithm 1). During inference, the temperature was set to \(\tau =\frac{3}{2}\) (Supplementary section ‘Softmax tempering’). Figure 3 displays these cluster assignments for the UK Biobank genomes. We group the individuals by their reported country of birth; those with missing or non-existent country-of-birth labels were excluded from the plots.

Although results are only displayed for K = 6, the multi-head architecture was trained for K = 2 to K = 6 simultaneously in approximately 11 h. In the barplots (used to visualize Q), each vertical bar represents an individual sample and stacked bar color heights represent the proportion of the sample’s ancestry assigned to that colored genetic cluster. Since they result from unsupervised clustering, interpretation of the cluster colors is left open. a, Q estimates of all the samples. Although many samples are clustered together (blue cluster, representing a northern European/British ancestry component), other clusters emerge reflecting the diverse modern populations now living within the United Kingdom. b, Q estimates of individuals born in the British and Irish Isles and territories. Samples from Gibraltar and the Channel Islands are excluded as they contain a very small number of individuals. c, Q estimates for individuals born outside of the British and Irish Isles are labeled by their country or region of birth, showcasing clusters representing Africans, East Asians, South Asians, Northern Europeans, and West Asians (sharing a cluster in part with Southern Europeans). Despite the large ancestry imbalance, Neural ADMIXTURE characterizes the globally diverse genetic variation found in the UK Biobank. Many UK residents born in other countries appear to have northern European (British) ancestry. These likely represent children born abroad to British parents, who later repatriated. We also note a sizeable South-Asian-like genetic ancestry cluster seen in many individuals born in East Africa. This likely stems from the decolonization era exodus out of East Africa of South Asians, who had settled there during the British Empire. The predicted cluster assignments for K = 2 to K = 6 for individuals born outside of the British and Irish Isles can be found in Extended Data Fig. 5.

To assess the scalability of different methods, we simulated multiple datasets with various numbers of variants and samples using the software reported previously17. The datasets consist of combinations of N ∈ {1,000, 5,000, 10,000, 20,000, 50,000} and M ∈ {1,000, 10,000, 50,000, 100,000}, where N and M are the number of samples and SNPs, respectively.

We compared the training times of ADMIXTURE, AlStructure, TeraStructure, and Neural ADMIXTURE, both on CPU and GPU, across different dataset sizes (Fig. 4). Neural ADMIXTURE is consistently faster than the alternatives. Moreover, Neural ADMIXTURE accelerates substantially using GPUs in contrast to the other methods. The hyperparameters used are described in Supplementary Table 3.

Neural ADMIXTURE has clearly faster execution times than the other benchmarked methods on both CPU and GPU. AlStructure results are not reported on the 50,000 samples because this method has prohibitively slow execution times.

Many unsupervised clustering methods for genotype sequences have been introduced8,20,21,22,23,24,25 including the most commonly used, ADMIXTURE6,7. These methods, which resemble a non-negative matrix factorization, decompose each input sequence into a set of cluster assignments and compute a centroid for each cluster. The cluster assignments give the proportion of each genetic ancestry cluster for an individual, whereas the cluster centroids give the SNP variant frequencies at each genetic position corresponding to each cluster. As a diploid organism, most humans have a paternal and maternal copy of each non-sex chromosome. Therefore, for a given individual at each genomic position, we have the possibility of four different combinations of biallelic SNPs (0/0, 0/1, 1/0, 1/1). It is common practice to sum both maternal and paternal variants, obtaining a count sequence nij. In this scenario, an individual i has nij ∈ {0, 1, 2} copies of the minority SNP j. ADMIXTURE models each individual’s count sequence, given a fixed number of population groups K, as nij ~ Bin(2, pij), where pij = ∑kqikfkj, with qik denoting the fraction of population k assigned to i, and fkj denoting the frequency of SNPs with a value of ‘1’ j in population k. ADMIXTURE applies block relaxation to find the parameters Q and F that minimize the negative log-likelihood function shown in equation (1). The value of K (number of clusters) is typically chosen by using an ad hoc cross-validation procedure7, necessitating runs across a range of values.

The block relaxation optimization in ADMIXTURE runs much faster than other approaches used by its main competitors, namely FRAPPE21 and STRUCTURE8. Although it can be run in multi-threading mode, greatly boosting the execution time, it is insufficient when dealing with either a large number of samples or a large number of SNPs. Here we instead use neural networks, whose architectures have begun to be explored for several other genetic structure tasks including haplotype segmentation, dimensionality reduction, and classification26,27,28,29,30,31,32,33,34,35 (Supplementary section ‘Related work’).

An important caveat when using soft-clustering techniques, such as Neural ADMIXTURE or ADMIXTURE, is that these techniques follow a modeling assumption that there are some ‘prototype’ populations and that each individual can be placed within the convex hull of such prototypes. Note that this model might not reflect the underlying structure of real-world populations particularly when independent genetic drift has occurred in each population following admixture events. This limitation is particularly acute in the case of ancient admixture events, and in such cases, other complementary techniques should also be used. Future experiments to quantify these effects using simulations would be valuable. Combining unsupervised clustering with tree-based methods to account for this drift would also be a useful direction. This could complement the progress being made in ancestral recombination graphs.

Although the computational times of Neural ADMIXTURE enable practitioners to obtain rapid results with multiple hyperparameters and different values of K, properly selecting the best results still involves a subjective element, and additional experiments and new quantitative measures are needed. Further, unsupervised clustering methods, and more generally dimensionality-reduction techniques, are affected by sampling imbalances between population groups, which can alter population structure detection and prioritization36,37. Additionally, even if structure is not present within the data, these techniques can indicate otherwise38,39.

As described in the Discussion, the existing ADMIXTURE algorithm minimizes the negative log-likelihood:

with Q = (qik) and F = (fkj).

This can be formulated as a non-negative matrix factorization problem. Let X denote the training samples, where the features are the alternate allele normalized counts per position and the jth SNP of the ith individual is represented as \({x}_{ij}=\frac{{n}_{ij}}{2}\in \{0,0.5,1\}\) . Then, X ≈ QF, where Q is the assignments, F is the alternate allele frequencies per SNP and population, and the negative log-likelihood in equation (1) is a distance between X and QF. This can be translated into a neural network as an autoencoder with Q = Ψ(X) being the bottleneck computed by the encoder function Ψ and F being the decoder weights themselves (Fig. 1a). Because Q is estimated at every forward pass and not learnt as a whole for the training data, to retrieve Q assignments on previously unseen data, we can perform a simple forward pass instead of running the optimization process fixing F, unlike with ADMIXTURE.

Note that the restrictions in the optimization problem (equation (1)) impose restrictions in the architecture. Those relating to Q (∑kqik = 1 and qik ≥ 0) can be enforced by applying a softmax activation at the encoder output, making the bottleneck equivalent to the cluster assignments. Although the decoder restriction (0 ≤ fkj ≤ 1) could be enforced by applying the sigmoid function to the decoder weights, we found that it suffices to project the weights of the decoder to the interval [0, 1] after every optimization step, one of the most common forms of projected gradient descent40.

The decoder must be linear and cannot be followed by a non-linearity, as this would break the interpretability of the F matrix; the equivalence between the decoder weights and cluster centroids would be lost. On the other hand, the encoder architecture is free from constraints, and it may be composed of several layers. The proposed architecture includes a 64-dimensional, non-linear layer with a GELU activation before the bottleneck and batch normalization acting directly on the input. The latter re-scales the data to have zero mean and unit variance. Since the mean for each SNP is its frequency p, and the standard deviation σ is \(\sqrt{p(1-p)}\) , the {0, 1} input gets encoded as \(\left\{{-\sqrt{\frac{p}{1-p}},\sqrt{\frac{1-p}{p}}}\right\}\) , thereby supplying more explicitly the information of the allele frequencies to the network.

The ADMIXTURE model does not precisely reconstruct the input data as a regular autoencoder would do, because the input SNP genotype sequences, nij ∈ {0, 1, 2}, and the reconstructions, pij ∈ [0, 1], do not have matching ranges. This can easily be remedied by dividing the genotype counts by two, so that the input data are \({x}_{ij}=\frac{{n}_{ij}}{2}\in \{0,0.5,1\}\) . Moreover, instead of minimizing \({{{{\mathcal{L}}}}}_\mathrm{C}\) (equation (1)), we propose minimizing the binary cross-entropy instead, using a penalty term on the Frobenius norm of the encoder weights, θ:

This regularization term avoids hard assignments in the bottleneck, which helps during the training process and reduces overfitting. In equation (3) we show that the proposed optimization problem and the ADMIXTURE one are equivalent (excluding the regularization term) by using equations (1) and (2):

A perfect reconstruction can of course be obtained by setting the number of clusters (K) equal to the number of training samples or to the dimension of the input (number of SNPs). However, the bottleneck should ideally capture elementary information about the population structure of the given sequences; therefore, we make use of low-dimensional bottlenecks.

In ADMIXTURE, cross-validation must be performed to choose the number of population clusters (K), unless specific prior information about the number of population ancestries is known. Furthermore, in many applications, practitioners desire to observe how cluster assignments change as the number of clusters increases. As the number of both sequenced individuals and variants increases, the feasible number of different cluster numbers that can be run for cross-validation rapidly decreases due to the additional computational cost. As a solution, Multi-head Neural ADMIXTURE allows all cluster numbers to be run simultaneously by taking advantage of the 64-dimensional latent representation computed by the encoder. This shared representation is jointly learnt for the different values of K, {K1, …, KH}.

Figure 1b shows how the shared representation is split into H different heads in the multi-head architecture. The ith head consists of a non-linear projection to a Ki-dimensional vector, which corresponds to an assignment that assumes there are Ki different genetic clusters in the data. Although every head could be concatenated and fed through a decoder, this would cause the decoder weights F to not be interpretable. Therefore, every head needs to have its own decoder and, thus, H different reconstructions of the input are retrieved.

As we have H reconstructions, we will now have H different loss values. We can train this architecture by minimizing equation (4):

where \({Q}_{K_{h}}\) and \({F}_{K_{h}}\) are, respectively, the cluster assignments and the SNP frequencies per population for the hth head. The restrictions of the ADMIXTURE optimization problem (equation (1)) must be satisfied by \({Q}_{K_{h}}\) and \({F}_{K_{h} }\,\,\forall h\in \{1,\ldots ,H\,\}\) .

The multi-head architecture allows computation of H different cluster assignments corresponding to H different values for K, efficiently, in a single forward pass. Results can then be quantitatively and qualitatively analyzed by the practitioner to decide which value of K is the most suitable for the data.

Let N denote the number of samples and M the number of variants (SNPs). To assess the performance of the Q estimates, we match the assignments with the known labels and report the RMSE between them,

and the RMSE between the known allele frequencies (FGT) and the estimated frequencies (F),

We also use a new metric, Δ, defined as

which is equivalent to the mean squared difference between the covariance matrices of the estimated and the target populations. In case the Q estimates completely agree with QGT (up to permutation), Δ will be zero. The larger the disagreement, the higher the value of Δ. We are interested in these metrics, as they are more easily interpreted than the loss function value itself. We are aware that these pseudo-supervised metrics, when applied to datasets simulated from real individuals, do not yield the true quality of the predictions of the models, since the biogeographic labels assigned to the real sequences used to simulate datasets might not reflect the true genomics clusters and variation within the populations. To further investigate this issue, we also used fully simulated population clusters to evaluate the methods.

For reproducibility we have used a comprehensive set of publicly available, labeled human whole-genome sequences from diverse populations across the world, combining the 1000 Genomes Project41, the Simons Genome Diversity Project42, and the Human Genome Diversity Project43, as well as data simulated from these samples using PyAdmix14 and data simulated de novo using the Balding–Nichols Pritchard–Stephens–Donnely model8,23. The populations within the combined real datasets can be found in Supplementary Table 2. Each subpopulation is aggregated into a continental-level label according to its geographical location (Supplementary section ‘Dataset description’). Additionally, we used the entire UK Biobank genotype dataset.

We compared Neural ADMIXTURE computational time and clustering quality with ADMIXTURE, fastSTRUCTURE24, AlStructure22, and TeraStructure23. fastSTRUCTURE assumes the STRUCTURE model but uses accelerated variational methods instead of MCMC, yielding speedups of more than two orders of magnitude against STRUCTURE. TeraStructure iteratively computes Q and F while avoiding a high computational load by subsampling SNPs at every iteration, which makes the algorithm faster. AlStructure first estimates a low-dimensional linear subspace of the admixture components and then searches for a model in the latter subspace that satisfies the modeling constraints, yielding a fast alternative to the iterative or maximum likelihood schemes followed by most algorithms. Furthermore, we also compared against HaploNet26, a variational autoencoder that maps parts of the sequence (windows) to a low-dimensional latent space, on which clustering is then performed using Gaussian mixture priors. Although the global structure of the data is preserved in the low-dimensional space, direct interpretability of the allele frequencies (available in Neural ADMIXTURE) is not preserved.

All models were optimized using 16 threads on an AMD EPYC 7742 (x86_64) processor, which consists of 64 cores and 512 GB of RAM. We restricted the number of threads to 16 despite the fact that more cores are available to run several executions in parallel. To assess GPU performance of Neural ADMIXTURE, all networks were trained on an NVIDIA Tesla V100 SXM2 of 32 GB. The same GPUs were used to run inference on the trained models.

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

The samples used in the ‘Experiments’ section were compiled from public datasets: 1000 Genomes Project (https://www.internationalgenome.org/data/)41, the Simons Genome Diversity Project (https://www.simonsfoundation.org/simons-genome-diversity-project/)42, and the Human Genome Diversity Project (https://www.internationalgenome.org/data-portal/data-collection/hgdp)43. The compiled datasets (All-Chms, Chm-22 and Chm-22-Sim) are available on figshare44. The UK Biobank has approval from the North West Multi-centre Research Ethics Committee as a Research Tissue Bank. This dataset is available to researchers through an open application via https://www.ukbiobank.ac.uk/register-apply/. The entire dataset of genotypes available to download from the UK Biobank portal were used. Source data are provided with this paper.

The software is available as an installable package in the PyPi repository under the name ‘neural-admixture’. The source code can be found at https://github.com/ai-sandbox/neural-admixture ref. 45.

Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

Privé, F. Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics. Bioinformatics 38, 3477–3480 (2022).

Morales, J. et al. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome Biol. 19, 1–10 (2018).

Mathieson, I. & Scally, A. What is ancestry? PLoS Genet. 16, e1008624 (2020).

Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).

Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinform. 12, 246 (2011).

Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).

Hendrycks, D. & Gimpel, K. Gaussian error linear units (GELUs). Preprint at https://doi.org/10.48550/arXiv.1606.08415 (2020).

Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008).

Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

Cutler, A. & Breiman, L. Archetypal analysis. Technometrics 36, 338–347 (1994).

Article MathSciNet MATH Google Scholar

Kumar, A., Montserrat, D. M., Bustamante, C. & Ioannidis, A. XGMix: local-ancestry inference with stacked XGBoost. Preprint at bioRxiv https://doi.org/10.1101/2020.04.21.053876 (2020).

Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288 (2013).

Karavani, E. et al. Screening human embryos for polygenic traits has limited utility. Cell 179, 1424–1435.e8 (2019).

Chiu, A., Molloy, E., Tan, Z., Talwalkar, A. & Sankararaman, S. Inferring population structure in biobank-scale genomic data. Am. J. Hum. Genet. 109, 727–737 (2022).

Behr, A. A., Liu, K. Z., Liu-Fang, G., Nakka, P. & Ramachandran, S. Pong: fast analysis and visualization of latent clusters in population genetic data. Bioinformatics 32, 2817–2823 (2016).

Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

Bradburd, G. S., Coop, G. M. & Ralph, P. L. Inferring continuous and discrete population genetic structure across space. Genetics 210, 33–52 (2018).

Tang, H., Peng, J., Wang, P. & Risch, N. J. Estimation of individual admixture: analytical and study design considerations. Genet. Epidemiol. 28, 289–301 (2005).

Cabreros, I. & Storey, J. D. A likelihood-free estimator of population structure bridging admixture models and principal components analysis. Genetics 212, 1009–1029 (2019).

Gopalan, P., Hao, W., Blei, D. & Storey, J. Scaling probabilistic models of genetic variation to millions of humans. Nat. Genet. 48, 1587–1590 (2016).

Raj, A., Stephens, M. & Pritchard, J. K. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589 (2014).

Gimbernat-Mayol, J., Dominguez Mantes, A., Bustamante, CD, Mas Montserrat, D. & Ioannidis, AG Archetypal analysis for population genetics.PLoS Comput.Biol. 18, e1010301 (2022).

Meisner, J. & Albrechtsen, A. Haplotype and population structure inference using neural networks in whole-genome sequencing data. Genome Res. 32, 1542–1552 (2022).

Joo, W., Lee, W., Park, S. & Moon, I.-C. Dirichlet variational autoencoder. Pattern Recognit. 107, 107514 (2020).

Keller, S. M., Samarin, M., Torres, F. A., Wieser, M. & Roth, V. Learning extremal representations with deep archetypal analysis. Int. J. Comput. Vis. 129, 805–820 (2021).

Ausmees, K. & Nettelblad, C. A deep learning framework for characterization of genotype data. G3 12, jkac020 (2022).

Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics 36, 3418–3421 (2020).

Battey, C., Coffing, G. C. & Kern, A. D. Visualizing population structure with variational autoencoders. G3 11, jkaa036 (2021).

Montserrat, D. M., Bustamante, C. & Ioannidis, A. LAI-Net: local-ancestry inference with neural networks. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing 1314–1318 (IEEE, 2020).

Oriol Sabat , B. , Mas Montserrat , D. , Giro-i Nieto , X. & Ioannidis , AG SALAI-Net: a species-agnostic local ancestry inference network.Bioinformatics 38, ii27–ii33 (2022).

Romero, A. et al. Diet networks: thin parameters for fat genomics. In 5th International Conference on Learning Representations (OpenReview.net, 2017).

Battey, C. J., Ralph, P. L. & Kern, A. D. Predicting geographic location from genetic variation with deep neural networks. eLife 9, e54507 (2020).

Toyama, K. S., Crochet, P.-A. & Leblois, R. Sampling schemes and drift can bias admixture proportions inferred by structure. Mol. Ecol. Resour. 20, 1769–1785 (2020).

Elhaik, E. Principal component analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated. Sci. Rep. 12, 14683 (2022).

Chari, T., Banerjee, J. & Pachter, L. The specious art of single-cell genomics. Preprint at bioRxiv https://doi.org/10.1101/2021.08.25.457696 (2021).

Montserrat, D. M. & Ioannidis, A. G. Adversarial attacks on genotype sequences. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2023).

Lin, C.-J. Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19, 2756–2779 (2007).

Article MathSciNet MATH Google Scholar

1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).

Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).

Dominguez Mantes, A. et al.Neural ADMIXTURE - datasets.figshare https://doi.org/10.6084/m9.figshare.19387538.v1 (2022).

Dominguez Mantes , A. , Ioannidis , AG & Montserrat , DM AI sandbox/neural admixture: stable release .Zenodo https://doi.org/10.5281/zenodo.7938892 (2023).

This work was partially supported by a grant from the Stanford Institute for Human-Centered Artificial Intelligence (HAI), NIH grants 7U01HG009080 and R01HG010140, and project PID2020-117142GB-I00 funded by MCIN/ AEI /10.13039/501100011033. This research was conducted using the UK Biobank Resource under Application Number 89006.

Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, US

Albert Dominguez Mantes, Daniel Mas Montserrat & Alexander G. Ioannidis

Signal Theory and Communications Department, Polytechnic University of Catalonia, Barcelona, Catalonia, Spain

Albert Dominguez Mantes & Xavier Giró-i-Nieto

School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Vaud, Switzerland

Galatea Bio, Hialeah, FL, US

Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, US

You can also search for this author in PubMed Google Scholar

A.G.I. and D.M.M. designed the research. A.D.M. performed the research and wrote the software. A.D.M., D.M.M., X.G.N., and A.G.I. interpreted the results. C.D.B. contributed data. A.D.M., D.M.M., and A.G.I. wrote the manuscript.

Correspondence to Alexander G. Ioannidis.

C.D.B. is the chief executive officer of Galatea Bio, and A.G.I. also holds shares. The remaining authors declare no competing interests.

Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Algorithms appearing closer in the MDS projection have more similar estimates than those farther away. In order to use MDS, a distance matrix of the Q results of different algorithms (including the ground truth matrix) has been computed by using the Frobenius norm between the different Q matrices. The average of the normalized distances has been taken across all datasets in order to retrieve a single distance matrix.

For K=3, European (EUR), West Asian (WAS) and South Asian (SAS) are combined within the same cluster, while American (AMR), Oceanian (OCE), and East Asian (EAS) are clustered together, and African (AFR) has its own cluster. These results reflect the genetic similarity between the respective groups due to their Out-of-Africa migration patterns and subsequent gene flow. After increasing to K=5, OCE obtains its own cluster, reflecting the ancient divergence from the others of that population consisting in our study of the Australo-Papuan groups-Native Australian (SGDP), Papuan Highlands (HGDP), Papuan Sepik (HGDP), Bougainville (HGDP), and Dusun (HGDP). As more clusters are incorporated, American (AMR) and EAS obtain their own clusters and OCE is divided between a component found predominantly in OCE and a component characteristic of EAS. The latter likely reflects the later migration of Austronesian speakers from East Asia out into the Pacific Islands, where they contributed their ancestry to the Oceanian inhabitants. A shared component between EUR, SAS and WAS is maintained, independent of the cluster number K. This could be linked to early farmer expansions out of West Asia and into both Europe and South Asia following the birth of agriculture, or to the much later expansion of the Indo-European language family across all of these regions. Other genetic exchanges between these neighboring regions doubtlessly played a role. With a sufficiently high number of clusters, a shared component between WAS and some AFR populations appears, perhaps reflecting North African gene flow.

To qualitatively assess the performance of Neural ADMIXTURE on related groups, we ran multi-head Neural ADMIXTURE on a subset of the dataset All-Chms containing 504 East Asian (EAS) individuals from neighboring regions. The self-reported ancestry of these individuals are Chinese Dai in Xishuangbanna, China (CDX, 93), Han Chinese in Beijing, China (CHB, 103), Han Chinese South (CHS, 105), Japanese in Tokyo, Japan (JPT, 104) and Kinh in Ho Chi Minh City, Vietnam (KHV, 99). The network was trained in its multi-head version from K=3 to K=7 using the PCK-Means initialization. The Japanese samples (JPT) are differentiated and clearly assigned their own cluster (blue), which is present only marginally in other populations. CDX (Chinese Dai) and KHV (Vietnamese Kinh) initially share the same cluster (K=3, green), reflecting their common Southeast Asian lineage, but are split into different groups at K=4 (purple and green). As expected CHB (Han Chinese in Beijing) and CHS (Han Chinese from South China) samples share the same cluster at first (red) and are only differentiated last (at K=5, red and orange). Further structure (yellow and brown) is seen within some populations at higher K.

To assess performance of the model using real admixed samples, we have trained a multi-head Neural ADMIXTURE model (from K=2 to K=5) with samples whose self-reported ancestry are African Caribbean in Barbados (ACB, 96), African Ancestry in Southwest US (ASW, 61), Colombian in Medellin, Colombia (CLM, 94), Mexican Ancestry in Los Angeles, California (MXL, 64), Peruvian in Lima, Peru (PEL, 85) and Puerto Rican in Puerto Rico (PUR 104). The groups have been selected from the 1000 Genomes Project. The variants used (839629) are the same as in the dataset All-Chms. The network was trained using the PCK-Means initialization (Supplementary Text ‘Decoder initialization’). At K=2, ACB and ASW are assigned predominantly to their own cluster, separating their mostly African origins from the remaining out-of-Africa components. When introducing the next new cluster (K=3), admixed individuals in CLM, MXL and PEL are assigned some fraction to it, differentiating an Indigenous American component in them from their European component. At K=4 the individuals in the PUR population are assigned some fraction of the new cluster, and this cluster is also present in small amounts in CLM and smaller amounts in some MXL. This component, which does not decrease the Indigenous American component fraction in the samples, likely represents an early colonial-era Spanish (European-ancestry) founder effect on the island of Puerto Rico perhaps reflecting the subsequent early colonial expansion from the Spanish Caribbean to coastal Colombia and Mexico. Structure in the European component appears at K=5.

(a) K=2 (b) K=3 (c) K=4 (d) K=5 (e) K=6. Because the majority of the dataset is composed of individuals with white British ancestry, we only plot the cluster assignments of individuals that reported a country-of-birth outside British and Irish Isles. We can observe that K=2 approximately divides samples between European and non-European populations. With K=3 European, South-and-East Asian, and African ancestry clusters emerge. When K=4 a fine-grained clustering emerges dividing East and South Asian populations. K=5 adds a fifth cluster shared in common (with different proportions) between Southern European (Mediterranean) and West Asian (Near Eastern) populations. Finally, K=6 seems to introduce a cluster mostly present in Northern and Eastern European populations.

Supplementary Text 1–7, Figs. 1–5 and Tables 1–3.

Q estimates of different methods on benchmarking training datasets.

Q estimates of ADMIXTURE and Neural ADMIXTURE on benchmarking test datasets.

Q estimates for ADMIXTURE and Q and F estimates for Neural ADMIXTURE on Chm-22-Sim and on admixed datasets.

Q estimates of Multi-head Neural ADMIXTURE (K = 2 to K = 6) on the UK Biobank dataset.

Runtimes of different methods on datasets of different numbers of samples and variants.

Q estimates of Multi-head Neural ADMIXTURE (K = 3 to K = 8) on the test data of Chm-22-Sim.

Q estimates of Multi-head Neural ADMIXTURE (K = 3 to K = 7) trained on samples from East Asia.

Q estimates of Multi-head Neural ADMIXTURE (K = 2 to K = 5) trained only on admixed samples.

Q estimates of Multi-head Neural ADMIXTURE (K = 2 to K = 6) trained on the UK Biobank dataset.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Dominguez Mantes, A., Mas Montserrat, D., Bustamante, CD et al.Neural ADMIXTURE for rapid genomic clustering.Nat Comput Sci 3, 621–629 (2023).https://doi.org/10.1038/s43588-023-00482-7

DOI: https://doi.org/10.1038/s43588-023-00482-7

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Nature Computational Science (Nat Comput Sci) ISSN 2662-8457 (online)

Standard Type High Performance Polycarboxylate Superplasticizer Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.