We analyzed data from a total of 2,733 participants from the GALA II16 and SAGE17 asthma case–control studies who self-identified as African American (AA; n = 757), Puerto Rican (PR; n = 893), Mexican American (MX; n = 784) or other Latino American (LA; n = 299) (Table 1 and Supplementary Table 1). The median age of the participants varied from 13.2 (PR) to 16.0 years (AA). Genome-wide, or global, genetic ancestry proportions were estimated for all participants (Fig. 1). The median global African ancestry was highest in AA (82.6%), followed by PR (19.7%), and lowest in MX (3.5%).
Heritability of gene expression in admixed populations
We compared the heritability (h2) and genetic variance (VG) of whole-blood gene expression attributed to common genetic variation (minor allele frequency (MAF) ≥ 0.01) within the cis-region across self-identified race/ethnicity groups and subpopulations defined based on genetic ancestry (see Methods). Across 17,657 genes, cis-heritability (Fig. 2a) was significantly higher in AA (median h2 = 0.097) compared with PR (h2 = 0.072; PWilcoxon = 2.2 × 10−50) and MX participants (h2 = 0.059; P = 3.3 × 10−134) and in PR compared with MX participants (PWilcoxon = 2.2 × 10−25) (Supplementary Table 2). Genetic variance (Fig. 2b) of whole-blood transcript levels in AA participants (median VG = 0.022) was higher than in PR participants (VG = 0.018; PWilcoxon = 4.0 × 10−19) and in MX participants (VG = 0.013; PWilcoxon = 5.6 × 10−135). The results remained unchanged when the sample size was fixed to n = 600 in all populations (Extended Data Fig. 1).
Next, we compared the distribution of h2 (Fig. 2c) and VG (Fig. 2d) between participants grouped based on proportions of global genetic ancestry (Supplementary Table 3 and Supplementary Fig. 1). Among participants with >50% African ancestry (AFRhigh; n = 721), cis-heritability (h2 = 0.098) and genetic variance (VG = 0.022) were higher than in n = 1,011 participants with <10% global African ancestry (AFRlow; h2 = 0.060 (PWilcoxon = 9.6 × 10−126) and VG = 0.013 (PWilcoxon = 7.6 × 10−106)). Among individuals with >50% Indigenous American (IAM) ancestry (IAMhigh; n = 610), cis-heritability (h2 = 0.059) and genetic variance (VG = 0.012) were lower than in participants with <10% IAM ancestry (IAMlow; h2 = 0.084 (P = 3.1 × 10−103) and VG = 0.020 (PWilcoxon = 3.1 × 10−158)). The results remained consistent when partitioning h2 and VG by coarse MAF bins, with larger differences in h2 and VG among 0.01 ≤ MAF ≤ 0.10 variants (Extended Data Fig. 2).
We also investigated the impact of ancestry at the locus level, defined as the number of alleles (zero, one or two) derived from each ancestral population at the transcription start site. Heritability was higher in individuals with homozygous local African ancestry (AFR/AFR) compared with AFR/European (EUR) ancestry (h2 = 0.096 versus h2 = 0.084, respectively; PWilcoxon = 1.4 × 10−14) and lower in participants with homozygous Indigenous American ancestry (IAM/IAM) compared with IAM/EUR ancestry (h2 = 0.055 versus 0.064, respectively; P = 1.6 × 10−7) (Fig. 2e). Differences in VG by local ancestry were statistically significant for AFR/AFR versus AFR/EUR (PWilcoxon = 2.0 × 10−7) and IAM/IAM versus IAM/EUR (PWilcoxon = 1.6 × 10−8) (Fig. 2f). The results were also consistent for VG comparisons within self-identified race/ethnicity groups (Supplementary Table 4).
We calculated heritability using linkage disequilibrium adjusted kinships (LDAKs). This method assumes that SNP-specific variance is inversely proportional not only to the MAF, but also to linkage disequilibrium tagging18. Estimates obtained using the LDAK-Thin model and genome-wide complex trait analysis (GCTA) were nearly identical across populations based on self-identified race/ethnicity (h2 = 0.094 for AA, 0.071 for PR and 0.059 for MX) and genetic ancestry (h2 = 0.104 for AFRhigh, 0.066 for AFRlow, 0.062 for IAMhigh and 0.093 for IAMlow; Supplementary Table 5).
Lastly, we tabulated the number of heritable genes for which global and/or local ancestry was significantly associated (false discovery rate (FDR) < 0.05) with transcript levels (Supplementary Fig. 2 and Supplementary Table 6). Global AFR ancestry was associated with the expression of 326 (2.4%) and 589 (4.5%) heritable genes in AA and PR, respectively. Associations with local, but not global, AFR ancestry were more common (8.9% in AA and 10.9% in PR). Local IAM ancestry was associated with the expression of 9.8% of genes in MX, compared with 2.8% for global IAM ancestry.
Assessment of ancestry-specific eQTLs
To understand the control of gene expression at a more granular level, we performed cis-eQTL analysis. A total of 19,567 genes with at least one cis-eQTL (eGenes) were found in the pooled sample. The largest number of eGenes was detected in AA (n = 17,336), followed by PR (n = 16,975) and MX participants (n = 15,938) (Supplementary Table 7 and Supplementary Fig. 3). The number of eGenes was similar in AFRhigh (n = 17,123) and AFRlow (n = 17,146) groups. When the sample size was fixed to n = 600 for all populations, the number of eGenes observed in AFRhigh (n = 16,100) was higher than in AFRlow (n = 14,344). The numbers of eGenes detected in the IAMlow (n = 14,866) and IAMhigh groups (n = 14,419) were similar. The number of linkage disequilibrium-independent (r2 < 0.10) cis-eQTLs per gene was significantly higher in the AFRhigh group than the AFRlow group (PWilcoxon = 2.7 × 10−246) and lower in the IAMhigh group compared with the IAMlow group (PWilcoxon = 2.8 × 10−33) (Extended Data Fig. 3).
To characterize ancestry-related differences in the regulation of gene expression, we developed a framework for identifying ancestry-specific eQTLs (anc-eQTLs) (see Methods, Fig. 3 and Supplementary Tables 8 and 9). For heritable protein-coding genes, we first compared the overlap in 95% credible sets of cis-eQTLs identified in participants with >50% global ancestry (AFRhigh and IAMhigh) and those with <10% of the same global ancestry (AFRlow and IAMlow). For genes with nonoverlapping 95% credible sets, we distinguished between population differences in MAF (tier 1) and linkage disequilibrium (tier 2). For genes with overlapping 95% credible sets, eQTLs were further examined for evidence of effect size heterogeneity between ancestry groups (tier 3).
Tier 1 anc-eQTLs were only common (MAF ≥ 0.01) in individuals with >50% AFR or IAM ancestry and were thus considered to be the most ancestry specific. Over 28% (n = 2,695) of genes contained at least one tier 1 AFRhigh anc-eQTL, while 7% (n = 562) of genes contained a tier 1 IAMhigh anc-eQTL (Supplementary Table 9). A representative example of a tier 1 AFRhigh anc-eQTL is rs3211938 (CD36), which has an MAF of 0.077 in the AFRhigh group and an MAF of 0.002 in the AFRlow group (Fig. 4a). This variant has been linked to high-density lipoprotein (HDL) cholesterol levels in several multi-ancestry GWASs that included African Americans19,20,21. Tier 2 anc-eQTLs, with ancestry-specific linkage disequilibrium patterning, had an MAF of 0.01 in both high (>50%) and low (<10%) global ancestry groups and were further fine-mapped using PESCA22. There were 109 genes (1.1%) that contained eQTLs with a posterior probability (PP) of being specific to AFRhigh of >0.80 and 33 genes (0.4%) matching the same criteria for IAMhigh (Supplementary Table 9). For instance, two lead eQTLs in low linkage disequilibrium were detected for TRAPPC6A in AFRhigh (rs12460041) and AFRlow (rs7247764) populations (Fig. 4b). These variants belonged to nonoverlapping credible sets and PESCA estimated that rs12460041 was specific to AFRhigh with a PP of 0.87 (Fig. 4c). Over 50% of heritable protein-coding genes (n = 5,058 for AFR and 5,355 for IAM) had overlapping 95% credible sets of eQTLs between high and low ancestry groups. Among these shared signals, a small proportion of eQTLs exhibited significant effect size heterogeneity (tier 3; 2.0% for AFRhigh and 1.0% for IAMhigh). For example, the KCNK17 eQTLs rs34247110 and rs3734618 were detected in the AFRhigh and AFRlow groups, but with significantly different effect sizes (Cochran’s Q P value = 1.8 × 10−10) in each population (Fig. 4d). Consistent with tier 3 eQTLs being observed in multiple ancestries, rs34247110 has been associated with type 2 diabetes in Japanese and multi-ancestry (European, African American, Hispanic and Asian) populations23,24.
The prevalence of any tier 1, 2 or 3 anc-eQTL was 30% (n = 2,961) for AFR ancestry and 8% (n = 679) for IAM ancestry. Overall, 3,333 genes had anc-eQTLs for either ancestry. The remaining genes (n = 6,648 for AFR and 7,836 for IAM) did not contain eQTLs with ancestry-related differences in MAF, linkage disequilibrium or effect size as outlined above. Increasing the global ancestry cut-off to >70% did not have an appreciable impact on anc-eQTLs in the AFRhigh group (28.1% overall and 27.3% for tier 1), but decreased the number of anc-eQTLs in the IAMhigh group (3.3% overall and 3.3% for tier 1), probably due to a greater reduction in sample size in this group (n = 212 versus n = 610, respectively; Supplementary Table 10). Considering all protein-coding genes without filtering based on heritability (n = 13,535), the prevalence of anc-eQTLs was 22% for the AFRhigh group, 5% for the IAMhigh group and 25% overall. The observation that anc-eQTLs were more common in participants with >50% AFR ancestry aligns with the higher h2 and VG values in this population and a greater number of linkage disequilibrium-independent cis-eQTLs (Extended Data Fig. 3). Among genes with tier 1 and 2 anc-eQTLs, 83% had higher h2 estimates in the AFRhigh group than in the AFRlow group, while this was observed for 57% of genes without any ancestry-specific eQTLs (Extended Data Fig. 4).
We detected 70 unique anc-eQTLs associated with 84 phenotypes from the NHGRI-EBI GWAS catalog25, most of which were tier 3 anc-eQTLs (59%) that mapped to blood cell traits, lipids and plasma protein levels (Supplementary Table 11). Colocalization with GWAS results from the multi-ancestry Population Architecture Using Genomics and Epidemiology (PAGE) study21 identified 78 eQTL–trait pairs with strong evidence of a shared genetic signal (PP4 > 0.80), 16 of which were anc-eQTLs (Supplementary Table 12). One illustrative example is rs7200153, an AFRhigh tier 1 anc-eQTL for the haptoglobin (HP) gene, which colocalized with total cholesterol (PP4 = 0.997; Extended Data Fig. 5). The 95% credible set included rs7200153 (PPSNP = 0.519) and rs5471 (PPSNP = 0.481; linkage disequilibrium r2 = 0.75), with rs5471 probably being the true causal variant given its proximity to the HP promoter, stronger effect of HP expression and decreased transcriptional activity of rs5471-C in West African populations26,27,28, which is supported by previous literature20,29,30.
We also performed trans-eQTL analyses that largely recapitulated our cis-eQTL results, with a larger number of trans-eGenes and independent (linkage disequilibrium r2 < 0.10) trans-eQTLs in populations with greater levels of AFR ancestry and lower levels of IAM ancestry (see Methods and Supplementary Table 13).
Performance of TWAS models trained in admixed populations
Following the PrediXcan approach7, we developed gene expression imputation models from the pooled GALA II and SAGE population (n = 2,733) for 11,830 heritable genes with a mean cross-validation of 0.157 (Supplementary Table 13 and Supplementary Fig. 4). We also generated population-specific models for African Americans (10,090 genes; cross-validation r2 = 0.180), Puerto Ricans (9,611 genes; cross-validation r2 = 0.163) and Mexican Americans (9,084 genes; cross-validation r2 = 0.167). Adjusting for local ancestry did not improve predictive performance (Supplementary Table 14).
We validated GALA II/SAGE TWAS models and compared with GTEx v8 in the Study of Asthma Phenotypes and Pharmacogenomic Interactions by Race-Ethnicity (SAPPHIRE)31—a study of 598 African American adults (Supplementary Fig. 5). The prediction accuracy was proportional to the degree of alignment in ancestry between the training and testing study samples. Across 5,254 genes with models available in all studies, the median Pearson’s correlation between genetically predicted and observed transcript levels was highest for pooled (r = 0.086) and AA (r = 0.083) models and lowest for GTEx (r = 0.049).
To evaluate the performance of the GALA II/SAGE TWAS models for gene discovery in admixed populations, we applied them to GWAS summary statistics for 28 traits from the PAGE study21 and compared them with TWASs using GTEx v8 (refs. 2,7) and the Multi-Ethnic Study of Atherosclerosis (MESA)3. GTEx v8 whole-blood models are based on 670 participants of predominantly European ancestry (85%)2, while MESA models impute monocyte gene expression3 from African American and Hispanic/Latino individuals (MESAAFHI; n = 585). The number of genes with available TWAS models was 39–82% higher in GALA II/SAGE compared with GTEx (n = 7,249) and MESAAFHI (n = 5,555). Restricting to 3,143 genes shared across all three studies, the cross-validation r2 was significantly higher in GALA II/SAGE compared with GTEx (PWilcoxon = 4.6 × 10−159) and MESAAFHI (PWilcoxon = 1.1 × 10−64) (Fig. 5a). TWAS models generated in GALA II/SAGE AA (n = 757) attained a higher cross-validation r2 than GTEx (PWilcoxon = 2.2 × 10−103), which had a comparable training sample size (Fig. 5b).
TWASs using GALA II/SAGE models across 28 PAGE traits identified a larger number of significant gene–trait pairs (n = 380; FDR < 0.05) than MESAAFHI (n = 303) and GTEx (n = 268), with only 35 significant gene–trait pairs detected in all three analyses (Fig. 5c). GALA II/SAGE models yielded a larger number of associated genes than MESA in 80% of analyses (binomial test; P = 0.012) and a larger number than GTEx in 79% of analyses (binomial test; P = 0.019). Of the 330 genes with an FDR of <0.05 in GALA II/SAGE, 143 (43%) were not present in GTEx and 199 (60%) were not present in MESAAFHI. For genes that were significant in at least one TWAS, z scores in GALA II/SAGE were highly correlated with GTEx (r = 0.74; P = 3.5 × 10−64; Fig. 5c) and MESAAFHI (r = 0.55; P = 8.5 × 10−27; Fig. 5d), suggesting that most genes have concordant effects even if they are not significant in both analyses.
HDL cholesterol exhibited one of the largest differences in TWAS associations (Fig. 6a), with over 60% more significant genes identified using GALA II/SAGE models (n = 29) than GTEx predictions (n = 11). TWAS models for several associated genes, including those with established effects on cholesterol transport and metabolism, such as CETP, were not available in GTEx. The top HDL-associated gene, CD36 (z score = −10.52; PTWAS = 6.9 × 10−26) had tier 1 AFRhigh anc-eQTLs (rs3211938) that were rare in European ancestry populations (MAF = 0.00013). The difference in MAF may explain why CD36 was not detected using GTEx (z score = 0.057; PTWAS = 0.95), even though all 43 variants from the GTEx model were available in PAGE summary statistics.
Although GALA II/SAGE multi-ancestry TWAS models showed robust performance, in some cases population-specific models may be preferred. For instance, benign neutropenia is a well-described phenomenon in persons of African ancestry and is attributed to variation in the 1q23.2 region. Applying GALA II/SAGE AA models to a meta-analysis of 13,476 individuals of African ancestry32 identified ACKR1 (PTWAS = 1.5 × 10−234), the atypical chemokine receptor gene that is the basis of the Duffy blood group system (Fig. 6b). This causal gene was missed by GTEx and MESAAFA. After conditioning on the Duffy-null rs2814778-CC genotype, no statistically significant TWAS associations remained on chromosome 1 (Supplementary Fig. 6). GALA II/SAGE AA models also detected seven genes outside of 1q23.2 that were not previously reported in GWASs of neutrophil counts: CREB5 (PTWAS = 1.5 × 10−14), DARS (PTWAS = 2.9 × 10−8), CD36 (PTWAS = 1.1 × 10−5), PPT2 (PTWAS = 1.3 × 10−5), SSH2 (PTWAS = 4.7 × 10−5), TOMM5 (PTWAS = 2.9 × 10−4) and ARF6 (PTWAS = 3.4 × 10−4).
Next, we performed TWASs of 22 blood-based biomarkers and quantitative traits using summary statistics from the UK Biobank (UKB). Ancestry-matched TWASs of UKB AFR (median GWAS n = 6,190) identified 56 gene–trait associations (FDR < 0.05), whereas GTEx detected only five genes (Extended Data Fig. 6). TWAS z scores for associated genes were modesty correlated (r = 0.37; 95% confidence interval (CI) = −0.01–0.66). TWASs in UKB EUR (median GWAS n = 400,223) also illustrated the advantage of ancestry-matched analyses, but the difference was less dramatic, with a 15% decrease in the number of genes that reached an FDR of <0.05 using GALA II/SAGE AA models and strong correlation between z scores (r = 0.77; 95% CI = 0.76–0.78). Concordance between significant associations across all traits was 28%, ranging from 32.7% for height to 7.6% for hemoglobin.