A deep catalogue of protein-coding variation in 983,578 individuals

A deep catalogue of protein-coding variation in 983,578 individuals

20 May 2024 | Kathie Y. Sun, Xiaodong Bai, Siying Chen, Suying Bao, Chuanyi Zhang, Manav Kapoor, Joshua Backman, Tyler Joseph, Evan Maxwell, George Mitra, Alexander Gorovits, Adam Mansfield, Boris Boutkov, Sujit Gokhale, Lukas Habegger, Anthony Marcketta, Adam E. Locke, Liron Ganel, Alicia Hawes, Michael D. Kessler, Deepika Sharma, Jeffrey Staples, Jonas Bovijn, Sahar Gelfman, Alessandro Di Gioia, Veera M. Rajagopal, Alexander Lopez, Jennifer Rico Varela, Jesús Alegre-Díaz, Jaime Berumen, Roberto Tapia-Conyer, Pablo Kuri-Morales, Jason Torres, Jonathan Emberson, Rory Collins, Regeneron Genetics Center, RGC-ME Cohort Partners, Michael Cantor, Timothy Thornton, Hyun Min Kang, John D. Overton, Alan R. Shuldiner, M. Laura Cremona, Mona Nafde, Aris Baras, Gonçalo Abecasis, Jonathan Marchini, Jeffrey G. Reid, William Salerno & Suganthi Balasubramanian
This study presents a comprehensive catalogue of human protein-coding variation derived from exome sequencing of 983,578 individuals from diverse populations. The catalogue includes over 10.4 million missense and 1.1 million predicted loss-of-function (pLOF) variants. Key findings include: 1. **Gene-Level Constraint**: The study estimated the indispensability of 16,710 protein-coding genes using a selection coefficient (snet), identifying 3,988 highly constrained genes. These genes are likely to have high functional importance, even though some lack known disease associations. 2. **Regional Constraint**: Missense regional constraint (MTR) was used to identify functionally important regions within genes. The study identified 41,114 missense constrained regions in 12,349 genes, highlighting key functional regions such as DNA-binding sites and active sites. 3. **Human Knockouts**: The dataset included 4,848 genes with rare biallelic pLOF variants, providing insights into gene function through natural 'human knockouts'. These genes were significantly less constrained compared to other genes, suggesting they are under less heterozygous selective pressure. 4. **Splice-Affecting Variants**: Predicted deleterious splice-affecting variants (SAVs) were identified using splice prediction tools. The study found that 296,696 predicted deleterious coding SAVs exceeded the MAPS-derived splicing thresholds, with 43.5% being cryptic splice sites. 5. **Clinical Utility**: The prevalence of disease-associated alleles was assessed, with 3.06% of individuals carrying pathogenic or likely pathogenic variants. The study also highlighted the importance of diverse populations in identifying rare variation and the need for caution in interpreting variant annotations. The RGC-ME dataset, publicly available, provides a valuable resource for interpreting rare variants and advancing precision medicine.This study presents a comprehensive catalogue of human protein-coding variation derived from exome sequencing of 983,578 individuals from diverse populations. The catalogue includes over 10.4 million missense and 1.1 million predicted loss-of-function (pLOF) variants. Key findings include: 1. **Gene-Level Constraint**: The study estimated the indispensability of 16,710 protein-coding genes using a selection coefficient (snet), identifying 3,988 highly constrained genes. These genes are likely to have high functional importance, even though some lack known disease associations. 2. **Regional Constraint**: Missense regional constraint (MTR) was used to identify functionally important regions within genes. The study identified 41,114 missense constrained regions in 12,349 genes, highlighting key functional regions such as DNA-binding sites and active sites. 3. **Human Knockouts**: The dataset included 4,848 genes with rare biallelic pLOF variants, providing insights into gene function through natural 'human knockouts'. These genes were significantly less constrained compared to other genes, suggesting they are under less heterozygous selective pressure. 4. **Splice-Affecting Variants**: Predicted deleterious splice-affecting variants (SAVs) were identified using splice prediction tools. The study found that 296,696 predicted deleterious coding SAVs exceeded the MAPS-derived splicing thresholds, with 43.5% being cryptic splice sites. 5. **Clinical Utility**: The prevalence of disease-associated alleles was assessed, with 3.06% of individuals carrying pathogenic or likely pathogenic variants. The study also highlighted the importance of diverse populations in identifying rare variation and the need for caution in interpreting variant annotations. The RGC-ME dataset, publicly available, provides a valuable resource for interpreting rare variants and advancing precision medicine.
Reach us at info@study.space