31 May 2024 | Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuhan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, Rui Yan
The paper introduces BioT5+, an advanced extension of the BioT5 framework designed to enhance biological research and drug discovery. BioT5+ incorporates several novel features to improve its capabilities:
1. **IUPAC Integration**: The model integrates IUPAC names for molecular understanding, allowing it to interpret chemical names as they appear in scientific literature, bridging the gap between formal molecular representations and textual descriptions.
2. **Expanded Data Sources**: BioT5+ includes extensive bio-text and molecule data from sources like bioRxiv and PubChem, broadening the knowledge base and enriching the contextual understanding of biological entities.
3. **Multi-task Instruction Tuning**: The model employs multi-task instruction tuning, which allows it to seamlessly integrate knowledge from diverse tasks, enhancing its predictive power and generalization capabilities across different biological and chemical domains.
4. **Advanced Numerical Tokenization**: An advanced character-based numerical tokenization technique is implemented to overcome the limitations of the original T5 dictionary, ensuring consistent and nuanced representation of numerical values.
BioT5+ is evaluated on 21 benchmark datasets across 3 types of problems (classification, regression, generation) and 15 different tasks, demonstrating state-of-the-art performance in most cases. The model's ability to capture intricate relationships in biological data makes it a significant contribution to computational biology and bioinformatics.The paper introduces BioT5+, an advanced extension of the BioT5 framework designed to enhance biological research and drug discovery. BioT5+ incorporates several novel features to improve its capabilities:
1. **IUPAC Integration**: The model integrates IUPAC names for molecular understanding, allowing it to interpret chemical names as they appear in scientific literature, bridging the gap between formal molecular representations and textual descriptions.
2. **Expanded Data Sources**: BioT5+ includes extensive bio-text and molecule data from sources like bioRxiv and PubChem, broadening the knowledge base and enriching the contextual understanding of biological entities.
3. **Multi-task Instruction Tuning**: The model employs multi-task instruction tuning, which allows it to seamlessly integrate knowledge from diverse tasks, enhancing its predictive power and generalization capabilities across different biological and chemical domains.
4. **Advanced Numerical Tokenization**: An advanced character-based numerical tokenization technique is implemented to overcome the limitations of the original T5 dictionary, ensuring consistent and nuanced representation of numerical values.
BioT5+ is evaluated on 21 benchmark datasets across 3 types of problems (classification, regression, generation) and 15 different tasks, demonstrating state-of-the-art performance in most cases. The model's ability to capture intricate relationships in biological data makes it a significant contribution to computational biology and bioinformatics.