[slides and audio] Ten common issues with reference sequence databases and how to mitigate them

This review highlights ten common issues with reference sequence databases and how to mitigate them. Metagenomic sequencing has revolutionized microbiology, but reference sequence databases are often flawed, leading to inaccurate classification. Issues include taxonomic misannotation, unspecific labeling, underrepresentation or overrepresentation of taxa, contamination, poor quality sequences, low complexity masking, and database maintenance challenges. Taxonomic misannotation can lead to false positives or incorrect classifications. Unspecific labeling may prevent accurate identification of species. Underrepresentation of certain taxa can result in missed detection, while overrepresentation can skew results. Contamination from multiple organisms in the same assembly or chimeric sequences can also affect accuracy. Poor quality sequences, such as fragmented or incomplete genomes, hinder classification. Low complexity sequences can cause false positives if not masked. Database maintenance is complex and resource-intensive, requiring regular updates and curation. Mitigation strategies include using high-quality databases, applying strict inclusion criteria, masking low complexity sequences, and ensuring regular updates. Tools like CheckM, BUSCO, and QUAST help assess and improve database quality. Long-read sequencing and improved taxonomic frameworks are expected to enhance reference sequence accuracy and reduce issues in metagenomic analysis.This review highlights ten common issues with reference sequence databases and how to mitigate them. Metagenomic sequencing has revolutionized microbiology, but reference sequence databases are often flawed, leading to inaccurate classification. Issues include taxonomic misannotation, unspecific labeling, underrepresentation or overrepresentation of taxa, contamination, poor quality sequences, low complexity masking, and database maintenance challenges. Taxonomic misannotation can lead to false positives or incorrect classifications. Unspecific labeling may prevent accurate identification of species. Underrepresentation of certain taxa can result in missed detection, while overrepresentation can skew results. Contamination from multiple organisms in the same assembly or chimeric sequences can also affect accuracy. Poor quality sequences, such as fragmented or incomplete genomes, hinder classification. Low complexity sequences can cause false positives if not masked. Database maintenance is complex and resource-intensive, requiring regular updates and curation. Mitigation strategies include using high-quality databases, applying strict inclusion criteria, masking low complexity sequences, and ensuring regular updates. Tools like CheckM, BUSCO, and QUAST help assess and improve database quality. Long-read sequencing and improved taxonomic frameworks are expected to enhance reference sequence accuracy and reduce issues in metagenomic analysis.

Ten common issues with reference sequence databases and how to mitigate them

15 March 2024 | Samuel D. Chorlton