Rapid and sensitive detection of genome contamination at scale with FCS-GX

Rapid and sensitive detection of genome contamination at scale with FCS-GX

2024 | Alexander Astashyn, Eric S. Tvedte, Deacon Sweeney, Victor Sapojnikov, Nathan Bouk, Victor Joukov, Eyal Mozes, Pooja K. Strope, Pape M. Sylla, Lukas Wagner, Shelby L. Bidwell, Larissa C. Brown, Karen Clark, Emily W. Davis, Brian Smith-White, Wratko Hlavina, Kim D. Pruitt, Valerie A. Schneider and Terence D. Murphy
FCS-GX is a tool developed by the National Center for Biotechnology Information (NCBI) to detect and remove genome contamination in new genome assemblies. It is part of the Foreign Contamination Screen (FCS) tool suite and is optimized for rapid and sensitive identification of contaminant sequences. FCS-GX can screen most genomes in 0.1–10 minutes, demonstrating high sensitivity and specificity for diverse contaminant species. When tested on 1.6 million GenBank assemblies, FCS-GX identified 36.8 Gbp of contamination, comprising 0.16% of total bases, with half from 161 assemblies. Assemblies in NCBI RefSeq were updated to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/ or https://doi.org/10.5281/zenodo.10651084. Genome contamination is a significant issue in genome assemblies, as it can lead to incorrect conclusions in biological inference. Contaminants can arise from various sources, including symbionts, infection, gut and surface microbes, and diet. FCS-GX uses a genome cross-species aligner to identify genome contamination from foreign organisms using hashed k-mer (h-mer) matches and a curated reference database. It is designed to address the challenges of detecting contaminants with high sensitivity and specificity, and to automate the identification and removal of contaminant sequences with minimal user interaction. FCS-GX has been shown to accurately detect contaminants with few false positives. It demonstrated high sensitivity across diverse samples from six tested kingdom groups, with 76% of prokaryote and 91% of eukaryote datasets achieving better than Sn=95% with 1 kbp fragments. It also showed high specificity, with Sp scores of >99.98% in all scenarios. FCS-GX is scalable and can process large numbers of genomes efficiently, with a throughput of 1.94 s/genome for prokaryote genomes. FCS-GX has been used to detect extensive contamination in NCBI databases, identifying 36.8 Gbp of suspected contamination in 1.6 million assemblies. It has also been used to clean RefSeq genomes, reducing contaminant bases in eukaryote and prokaryote genomes by 90 and 53%, respectively, compared to their peaks in 2020. FCS-GX is not adversely affected by lateral gene transfer, as it can classify chimeric sequences with a mix of correct and contaminant spans. FCS-GX is a fast and accurate tool for detecting and removing genome contamination, with high specificity and sensitivity. It is recommended for use by genome submitters to screen sequences early in the assembly process. FCS-GX is availableFCS-GX is a tool developed by the National Center for Biotechnology Information (NCBI) to detect and remove genome contamination in new genome assemblies. It is part of the Foreign Contamination Screen (FCS) tool suite and is optimized for rapid and sensitive identification of contaminant sequences. FCS-GX can screen most genomes in 0.1–10 minutes, demonstrating high sensitivity and specificity for diverse contaminant species. When tested on 1.6 million GenBank assemblies, FCS-GX identified 36.8 Gbp of contamination, comprising 0.16% of total bases, with half from 161 assemblies. Assemblies in NCBI RefSeq were updated to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/ or https://doi.org/10.5281/zenodo.10651084. Genome contamination is a significant issue in genome assemblies, as it can lead to incorrect conclusions in biological inference. Contaminants can arise from various sources, including symbionts, infection, gut and surface microbes, and diet. FCS-GX uses a genome cross-species aligner to identify genome contamination from foreign organisms using hashed k-mer (h-mer) matches and a curated reference database. It is designed to address the challenges of detecting contaminants with high sensitivity and specificity, and to automate the identification and removal of contaminant sequences with minimal user interaction. FCS-GX has been shown to accurately detect contaminants with few false positives. It demonstrated high sensitivity across diverse samples from six tested kingdom groups, with 76% of prokaryote and 91% of eukaryote datasets achieving better than Sn=95% with 1 kbp fragments. It also showed high specificity, with Sp scores of >99.98% in all scenarios. FCS-GX is scalable and can process large numbers of genomes efficiently, with a throughput of 1.94 s/genome for prokaryote genomes. FCS-GX has been used to detect extensive contamination in NCBI databases, identifying 36.8 Gbp of suspected contamination in 1.6 million assemblies. It has also been used to clean RefSeq genomes, reducing contaminant bases in eukaryote and prokaryote genomes by 90 and 53%, respectively, compared to their peaks in 2020. FCS-GX is not adversely affected by lateral gene transfer, as it can classify chimeric sequences with a mix of correct and contaminant spans. FCS-GX is a fast and accurate tool for detecting and removing genome contamination, with high specificity and sensitivity. It is recommended for use by genome submitters to screen sequences early in the assembly process. FCS-GX is available
Reach us at info@study.space