This paper introduces an automated method for de novo identification and classification of repeat sequence families in sequenced genomes. The approach extends traditional single linkage clustering by incorporating multiple alignment information to define repeat boundaries and distinguish homologous but distinct families. The method, called RECON, was tested on the human genome and successfully identified known transposable elements. It is useful for initial classification of repeats in newly sequenced genomes.
Repetitive sequences are a major component of eukaryotic genomes and are classified into three main types: local repeats, dispersed repeats, and segmental duplications. While computational tools like RepeatMasker are effective for known repeat families, they require precompiled libraries and are not suitable for new genomes. RECON addresses this by using multiple alignment data to infer element boundaries and define biologically meaningful families.
The paper discusses challenges in defining repeat boundaries, particularly when dealing with partial or fragmented repeats, segmental duplications, and related but distinct elements. Existing methods, such as single linkage clustering, often fail to accurately define these boundaries. RECON improves upon these methods by using multiple alignment information to distinguish between different biological scenarios, such as partial elements and segmental duplications.
The RECON algorithm is described in detail, including steps for defining elements, reevaluating their boundaries, and clustering them into families based on sequence similarity. The algorithm was implemented as a set of C programs and Perl scripts, and tested on a sample of the human genome. It was compared to RepeatMasker and showed improved performance in accurately identifying repeat families, particularly for the Alu element.
The paper also discusses the limitations of RECON, including its inability to recover highly fragmented families in one piece and its sensitivity to alignment end clustering assumptions. It suggests that parameter tuning may be necessary for different repeat compositions. Overall, RECON provides a valuable tool for de novo identification of repeat families in sequenced genomes, improving upon existing methods by incorporating multiple alignment information and addressing challenges in defining repeat boundaries.This paper introduces an automated method for de novo identification and classification of repeat sequence families in sequenced genomes. The approach extends traditional single linkage clustering by incorporating multiple alignment information to define repeat boundaries and distinguish homologous but distinct families. The method, called RECON, was tested on the human genome and successfully identified known transposable elements. It is useful for initial classification of repeats in newly sequenced genomes.
Repetitive sequences are a major component of eukaryotic genomes and are classified into three main types: local repeats, dispersed repeats, and segmental duplications. While computational tools like RepeatMasker are effective for known repeat families, they require precompiled libraries and are not suitable for new genomes. RECON addresses this by using multiple alignment data to infer element boundaries and define biologically meaningful families.
The paper discusses challenges in defining repeat boundaries, particularly when dealing with partial or fragmented repeats, segmental duplications, and related but distinct elements. Existing methods, such as single linkage clustering, often fail to accurately define these boundaries. RECON improves upon these methods by using multiple alignment information to distinguish between different biological scenarios, such as partial elements and segmental duplications.
The RECON algorithm is described in detail, including steps for defining elements, reevaluating their boundaries, and clustering them into families based on sequence similarity. The algorithm was implemented as a set of C programs and Perl scripts, and tested on a sample of the human genome. It was compared to RepeatMasker and showed improved performance in accurately identifying repeat families, particularly for the Alu element.
The paper also discusses the limitations of RECON, including its inability to recover highly fragmented families in one piece and its sensitivity to alignment end clustering assumptions. It suggests that parameter tuning may be necessary for different repeat compositions. Overall, RECON provides a valuable tool for de novo identification of repeat families in sequenced genomes, improving upon existing methods by incorporating multiple alignment information and addressing challenges in defining repeat boundaries.