2020 | Shanika L. Amarasinghe, Shian Su, Xueyi Dong, Luke Zappia, Matthew E. Ritchie, Quentin Gouil
The article reviews the current landscape of long-read sequencing technologies and their applications in genomics, focusing on the challenges and opportunities in data analysis. Long-read sequencing, particularly from Pacific Biosciences' (PacBio) SMRT and Oxford Nanopore Technologies' nanopore platforms, offers advantages over short-read sequencing in terms of read length, accuracy, and the ability to sequence native molecules without amplification bias. However, the rapid development of these technologies and the need for specialized analysis tools pose significant challenges.
The authors introduce long-read-tools.org, an online interactive database that compiles and categorizes tools for long-read data analysis, facilitating user access and exploration. They discuss key principles in long-read data analysis, including basecalling, error correction, base modification detection, and transcriptomics. Basecalling, the conversion of raw data to nucleic acid sequences, is a critical step that varies between SMRT and nanopore technologies. Error correction methods, both non-hybrid and hybrid (using short-read data), are essential for improving the accuracy of long-read assemblies. Base modification detection, particularly in RNA, is another area of focus, with advancements in both SMRT and nanopore sequencing technologies. Long-read transcriptomics is highlighted as a rapidly growing field, but it faces challenges such as high error rates and the need for comprehensive annotations.
The article also emphasizes the importance of benchmarking and the development of best practices to ensure the reliability and efficiency of long-read analysis tools. Despite the progress, scalability, data integration, and the need for more accurate and comprehensive annotations remain significant hurdles in the field. The authors conclude by highlighting the potential of long-read sequencing in genomics and the ongoing efforts to overcome the challenges in data analysis.The article reviews the current landscape of long-read sequencing technologies and their applications in genomics, focusing on the challenges and opportunities in data analysis. Long-read sequencing, particularly from Pacific Biosciences' (PacBio) SMRT and Oxford Nanopore Technologies' nanopore platforms, offers advantages over short-read sequencing in terms of read length, accuracy, and the ability to sequence native molecules without amplification bias. However, the rapid development of these technologies and the need for specialized analysis tools pose significant challenges.
The authors introduce long-read-tools.org, an online interactive database that compiles and categorizes tools for long-read data analysis, facilitating user access and exploration. They discuss key principles in long-read data analysis, including basecalling, error correction, base modification detection, and transcriptomics. Basecalling, the conversion of raw data to nucleic acid sequences, is a critical step that varies between SMRT and nanopore technologies. Error correction methods, both non-hybrid and hybrid (using short-read data), are essential for improving the accuracy of long-read assemblies. Base modification detection, particularly in RNA, is another area of focus, with advancements in both SMRT and nanopore sequencing technologies. Long-read transcriptomics is highlighted as a rapidly growing field, but it faces challenges such as high error rates and the need for comprehensive annotations.
The article also emphasizes the importance of benchmarking and the development of best practices to ensure the reliability and efficiency of long-read analysis tools. Despite the progress, scalability, data integration, and the need for more accurate and comprehensive annotations remain significant hurdles in the field. The authors conclude by highlighting the potential of long-read sequencing in genomics and the ongoing efforts to overcome the challenges in data analysis.