[slides and audio] UMI-tools%3A Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy

UMI-tools is a software package that improves quantification accuracy in high-throughput sequencing by modeling sequencing errors in Unique Molecular Identifiers (UMIs). UMIs are random barcodes used to distinguish true molecular duplicates from PCR duplicates. However, sequencing errors in UMIs are often ignored, leading to inaccurate quantification. UMI-tools introduces network-based methods to account for these errors when identifying PCR duplicates, improving accuracy in both simulated and real data, including iCLIP and single-cell RNA-seq datasets. The methods are implemented in the open-source UMI-tools software. The study shows that UMI errors, such as nucleotide substitutions, miscalling, and insertions/deletions, can create artificial UMIs, inflating the estimation of unique molecules. These errors are more common than UMI indels, which affect alignment coordinates. The study proposes three methods to identify unique molecules: cluster, adjacency, and directional. The directional method is the most accurate and robust, as it considers the counts of UMIs and their relationships, reducing the impact of sequencing errors. The methods were tested on simulated data and real iCLIP and single-cell RNA-seq datasets. The directional method outperformed other methods in accuracy and reduced variability. It improved reproducibility between iCLIP replicates and enhanced clustering in single-cell RNA-seq data. The study also found that longer UMIs can reduce accuracy if UMI errors are not accounted for, and that the directional method performs better with longer UMIs. UMI-tools is implemented as a command-line tool with two commands: extract and dedup. Extract appends UMIs to read identifiers, while dedup removes PCR duplicates using UMI sequences. The software is available as open-source and can be integrated into existing pipelines for sequencing analysis. The study recommends using UMIs of at least 8 bp in length and longer UMIs for higher sequencing depth experiments. The results demonstrate the importance of modeling UMI errors to improve quantification accuracy and reproducibility in sequencing experiments.UMI-tools is a software package that improves quantification accuracy in high-throughput sequencing by modeling sequencing errors in Unique Molecular Identifiers (UMIs). UMIs are random barcodes used to distinguish true molecular duplicates from PCR duplicates. However, sequencing errors in UMIs are often ignored, leading to inaccurate quantification. UMI-tools introduces network-based methods to account for these errors when identifying PCR duplicates, improving accuracy in both simulated and real data, including iCLIP and single-cell RNA-seq datasets. The methods are implemented in the open-source UMI-tools software. The study shows that UMI errors, such as nucleotide substitutions, miscalling, and insertions/deletions, can create artificial UMIs, inflating the estimation of unique molecules. These errors are more common than UMI indels, which affect alignment coordinates. The study proposes three methods to identify unique molecules: cluster, adjacency, and directional. The directional method is the most accurate and robust, as it considers the counts of UMIs and their relationships, reducing the impact of sequencing errors. The methods were tested on simulated data and real iCLIP and single-cell RNA-seq datasets. The directional method outperformed other methods in accuracy and reduced variability. It improved reproducibility between iCLIP replicates and enhanced clustering in single-cell RNA-seq data. The study also found that longer UMIs can reduce accuracy if UMI errors are not accounted for, and that the directional method performs better with longer UMIs. UMI-tools is implemented as a command-line tool with two commands: extract and dedup. Extract appends UMIs to read identifiers, while dedup removes PCR duplicates using UMI sequences. The software is available as open-source and can be integrated into existing pipelines for sequencing analysis. The study recommends using UMIs of at least 8 bp in length and longer UMIs for higher sequencing depth experiments. The results demonstrate the importance of modeling UMI errors to improve quantification accuracy and reproducibility in sequencing experiments.

UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy

2017 | Tom Smith, Andreas Heger, and Ian Sudbery