[slides] Ab initio gene identification in metagenomic sequences

This paper presents an improved method for ab initio gene identification in metagenomic sequences. The method uses evolutionary dependencies between oligonucleotide frequencies in protein-coding regions and genome nucleotide composition to estimate model parameters. This approach was originally developed in 1999 and has been used for gene finding in viral genomes and initializing gene finding algorithms. Recent advances include using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as separating models for bacteria and archaea. These improvements have increased the accuracy of gene prediction. The refined method was tested on known prokaryotic genomes split into short sequences, showing improved accuracy compared to previous methods. Application of the new method to human and mouse gut metagenomes revealed thousands of new genes. The method uses a hidden Markov model (HMM) for gene finding, with parameters derived from genome-specific codon frequencies. The method involves estimating codon frequencies based on global nucleotide frequencies and using these to derive parameters for second-order Markov models. Linear regression was used to approximate codon frequencies based on genome GC content. The method was further refined using non-linear regression techniques, including polynomial and logistic regression, to better approximate codon frequencies. The method was tested on various sequence lengths and showed improved accuracy in gene prediction. It was also applied to human and mouse gut microbiomes, identifying new genes and improving annotation. The method uses two distinct heuristic models for bacterial and archaeal sequences, or mesophilic and thermophilic species, to improve accuracy. The method was also used to infer the origin of genes and sequence fragments, showing high accuracy in identifying bacterial and archaeal origins. The method was applied to metagenomic sequences from human and mouse gut microbiomes, identifying new genes and improving annotation. The method was also used to develop a web interface for gene prediction in metagenomic sequences. The results show that gene prediction in fragmented sequences of prokaryotic genomes has the same rate of success as in complete genomes. The method was found to be more accurate than previous methods, including MetaGene and MetaGeneAnnotator. The method uses a heuristic approach to derive model parameters based on evolutionary dependencies between oligonucleotide frequencies and genome nucleotide composition. The method has been used in gene prediction and annotation since 1999 and has been applied to various organisms, including viral genomes and metagenomic sequences.This paper presents an improved method for ab initio gene identification in metagenomic sequences. The method uses evolutionary dependencies between oligonucleotide frequencies in protein-coding regions and genome nucleotide composition to estimate model parameters. This approach was originally developed in 1999 and has been used for gene finding in viral genomes and initializing gene finding algorithms. Recent advances include using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as separating models for bacteria and archaea. These improvements have increased the accuracy of gene prediction. The refined method was tested on known prokaryotic genomes split into short sequences, showing improved accuracy compared to previous methods. Application of the new method to human and mouse gut metagenomes revealed thousands of new genes. The method uses a hidden Markov model (HMM) for gene finding, with parameters derived from genome-specific codon frequencies. The method involves estimating codon frequencies based on global nucleotide frequencies and using these to derive parameters for second-order Markov models. Linear regression was used to approximate codon frequencies based on genome GC content. The method was further refined using non-linear regression techniques, including polynomial and logistic regression, to better approximate codon frequencies. The method was tested on various sequence lengths and showed improved accuracy in gene prediction. It was also applied to human and mouse gut microbiomes, identifying new genes and improving annotation. The method uses two distinct heuristic models for bacterial and archaeal sequences, or mesophilic and thermophilic species, to improve accuracy. The method was also used to infer the origin of genes and sequence fragments, showing high accuracy in identifying bacterial and archaeal origins. The method was applied to metagenomic sequences from human and mouse gut microbiomes, identifying new genes and improving annotation. The method was also used to develop a web interface for gene prediction in metagenomic sequences. The results show that gene prediction in fragmented sequences of prokaryotic genomes has the same rate of success as in complete genomes. The method was found to be more accurate than previous methods, including MetaGene and MetaGeneAnnotator. The method uses a heuristic approach to derive model parameters based on evolutionary dependencies between oligonucleotide frequencies and genome nucleotide composition. The method has been used in gene prediction and annotation since 1999 and has been applied to various organisms, including viral genomes and metagenomic sequences.

Ab initio gene identification in metagenomic sequences

2010 | Wenhan Zhu, Alexandre Lomsadze and Mark Borodovsky