Understanding NCBI prokaryotic genome annotation pipeline

The NCBI Prokaryotic Genome Annotation Pipeline (PGAP) is a comprehensive system for automatically annotating prokaryotic genomes. It integrates alignment-based methods with direct sequence-based prediction of protein and RNA genes. The pipeline uses a pan-genome approach, defining core genes common to most genomes in a clade. It incorporates specialized tools for identifying non-protein-coding elements like CRISPR regions and uses a two-pass approach to detect frameshifted genes and pseudogenes. GeneMarkS+ is a key tool that integrates extrinsic and intrinsic information for gene prediction. PGAP uses a modular framework (GPipe) for efficient processing and provides a robust system for tracking and managing annotation data. The pipeline generates detailed annotations, including protein and RNA gene predictions, and supports multiple evidence types. It also includes a new protein data model to reduce redundancy. PGAP has been used to annotate over 8000 GenBank genomes and re-annotate over 30,000 RefSeq genomes. The pipeline is continuously updated to improve accuracy and efficiency, incorporating new algorithms and features. PGAP is essential for handling the increasing volume of prokaryotic genome data and ensuring high-quality annotations. It is widely used by the scientific community for genome annotation and has been integrated into NCBI's submission systems. The pipeline's robustness and flexibility make it a critical tool for prokaryotic genome analysis.The NCBI Prokaryotic Genome Annotation Pipeline (PGAP) is a comprehensive system for automatically annotating prokaryotic genomes. It integrates alignment-based methods with direct sequence-based prediction of protein and RNA genes. The pipeline uses a pan-genome approach, defining core genes common to most genomes in a clade. It incorporates specialized tools for identifying non-protein-coding elements like CRISPR regions and uses a two-pass approach to detect frameshifted genes and pseudogenes. GeneMarkS+ is a key tool that integrates extrinsic and intrinsic information for gene prediction. PGAP uses a modular framework (GPipe) for efficient processing and provides a robust system for tracking and managing annotation data. The pipeline generates detailed annotations, including protein and RNA gene predictions, and supports multiple evidence types. It also includes a new protein data model to reduce redundancy. PGAP has been used to annotate over 8000 GenBank genomes and re-annotate over 30,000 RefSeq genomes. The pipeline is continuously updated to improve accuracy and efficiency, incorporating new algorithms and features. PGAP is essential for handling the increasing volume of prokaryotic genome data and ensuring high-quality annotations. It is widely used by the scientific community for genome annotation and has been integrated into NCBI's submission systems. The pipeline's robustness and flexibility make it a critical tool for prokaryotic genome analysis.

NCBI prokaryotic genome annotation pipeline

2016 | Tatiana Tatusova, Michael DiCuccio, Azat Badretdin, Vyacheslav Chetvernin, Eric P. Nawrocki, Leonid Zaslavsky, Alexandre Lomsadze, Kim D. Pruitt, Mark Borodovsky and James Ostell