2016, Vol. 44, No. 14 | Tatiana Tatusova, Michael DiCuccio, Azat Badretdin, Vyacheslav Chetvernin, Eric P. Nawrocki, Leonid Zaslavsky, Alexandre Lomsadze, Kim D. Pruitt, Mark Borodovsky, and James Ostell
The NCBI Prokaryotic Genome Annotation Pipeline (PGAP) is an advanced tool for automatic annotation of prokaryotic genomes, combining alignment-based methods with ab initio gene prediction. The pipeline uses a pan-genome approach, leveraging homologous protein clusters to generate initial annotations and refine them using statistical predictions. Key features include:
1. **Pan-genome Approach**: Utilizes homologous protein clusters to define core genes and proteins, which serve as a map for annotation.
2. **Multiple Evidence Types**: Integrates alignment-based hints and ab initio predictions to improve accuracy.
3. **Two-Pass Annotation**: Aims to detect frameshifted genes and pseudogenes, enhancing the robustness of annotations.
4. **GeneMarkS+**: A self-training gene finder that integrates external evidence to predict protein-coding regions.
5. **High-Performance Execution**: Uses a modular framework (GPipe) for distributed computing and robust tracking of tasks.
6. **Quality Control**: Incorporates validation procedures to ensure biologically valid and consistently formatted data.
The PGAP pipeline has been shown to match GenBank annotations in over 98% of cases and has been integrated into the GenBank submission system. It has also been used to re-annotate RefSeq genomes, improving consistency and accuracy. The pipeline is designed to handle both well-studied and novel taxonomic lineages, making it a flexible and extensible tool for prokaryotic genome annotation.The NCBI Prokaryotic Genome Annotation Pipeline (PGAP) is an advanced tool for automatic annotation of prokaryotic genomes, combining alignment-based methods with ab initio gene prediction. The pipeline uses a pan-genome approach, leveraging homologous protein clusters to generate initial annotations and refine them using statistical predictions. Key features include:
1. **Pan-genome Approach**: Utilizes homologous protein clusters to define core genes and proteins, which serve as a map for annotation.
2. **Multiple Evidence Types**: Integrates alignment-based hints and ab initio predictions to improve accuracy.
3. **Two-Pass Annotation**: Aims to detect frameshifted genes and pseudogenes, enhancing the robustness of annotations.
4. **GeneMarkS+**: A self-training gene finder that integrates external evidence to predict protein-coding regions.
5. **High-Performance Execution**: Uses a modular framework (GPipe) for distributed computing and robust tracking of tasks.
6. **Quality Control**: Incorporates validation procedures to ensure biologically valid and consistently formatted data.
The PGAP pipeline has been shown to match GenBank annotations in over 98% of cases and has been integrated into the GenBank submission system. It has also been used to re-annotate RefSeq genomes, improving consistency and accuracy. The pipeline is designed to handle both well-studied and novel taxonomic lineages, making it a flexible and extensible tool for prokaryotic genome annotation.