[slides and audio] Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool

The paper introduces pangolin, a computational tool for assigning the most likely lineage to a given SARS-CoV-2 genome sequence based on the Pango dynamic lineage nomenclature system. Pangolin has been widely used to analyze SARS-CoV-2 genomic data, with over 1.8 million genomes assigned to lineages using the pangolin web application. The tool enables researchers to access actionable information about the pandemic's transmission lineages. Pangolin uses a combination of manual curation and machine learning to assign lineages. The Pango lineage nomenclature system is hierarchical and fine-scaled, designed to capture the leading edge of pandemic transmission. Each Pango lineage aims to define an epidemiologically relevant phylogenetic cluster. The tool uses a machine learning model called pangoLEARN, which is trained on sequence data from GISAID and updated regularly to reflect new lineage designations. The paper describes the development and testing of pangolin, including its performance in cases of excess diversity, varying levels of ambiguity, and in the face of novel recombinants. The tool was tested on simulated data, showing that it can accurately assign lineages even when there is a high level of ambiguity or when sequences are highly divergent from the training data. However, the tool has limitations, including its inability to detect novel recombinants and its dependence on regular updates to the list of Pango designated sequences. The paper also discusses the limitations of the approach, including the inability to handle ambiguous data and the potential for misassignments when incomplete data are queried. The authors note that the Pango nomenclature system is hierarchical, and a more rootward assignment can be interpreted as a lower resolution classification rather than an incorrect one. The authors conclude that pangolin is a responsive, scalable tool for lineage assignment, and that the framework it implements could be adapted for use in future outbreaks involving other viruses. The tool is publicly available on GitHub and is open-source, allowing the broader community to contribute to the growing dynamic list of SARS-CoV-2 lineages. The paper highlights the importance of genomic surveillance in understanding the spread and evolution of SARS-CoV-2, and the role of tools like pangolin in enabling real-time analysis of genomic data.The paper introduces pangolin, a computational tool for assigning the most likely lineage to a given SARS-CoV-2 genome sequence based on the Pango dynamic lineage nomenclature system. Pangolin has been widely used to analyze SARS-CoV-2 genomic data, with over 1.8 million genomes assigned to lineages using the pangolin web application. The tool enables researchers to access actionable information about the pandemic's transmission lineages. Pangolin uses a combination of manual curation and machine learning to assign lineages. The Pango lineage nomenclature system is hierarchical and fine-scaled, designed to capture the leading edge of pandemic transmission. Each Pango lineage aims to define an epidemiologically relevant phylogenetic cluster. The tool uses a machine learning model called pangoLEARN, which is trained on sequence data from GISAID and updated regularly to reflect new lineage designations. The paper describes the development and testing of pangolin, including its performance in cases of excess diversity, varying levels of ambiguity, and in the face of novel recombinants. The tool was tested on simulated data, showing that it can accurately assign lineages even when there is a high level of ambiguity or when sequences are highly divergent from the training data. However, the tool has limitations, including its inability to detect novel recombinants and its dependence on regular updates to the list of Pango designated sequences. The paper also discusses the limitations of the approach, including the inability to handle ambiguous data and the potential for misassignments when incomplete data are queried. The authors note that the Pango nomenclature system is hierarchical, and a more rootward assignment can be interpreted as a lower resolution classification rather than an incorrect one. The authors conclude that pangolin is a responsive, scalable tool for lineage assignment, and that the framework it implements could be adapted for use in future outbreaks involving other viruses. The tool is publicly available on GitHub and is open-source, allowing the broader community to contribute to the growing dynamic list of SARS-CoV-2 lineages. The paper highlights the importance of genomic surveillance in understanding the spread and evolution of SARS-CoV-2, and the role of tools like pangolin in enabling real-time analysis of genomic data.