Fine-tuning large language models for chemical text mining

Fine-tuning large language models for chemical text mining

2024 | Wei Zhang, Qinggong Wang, Xiangtai Kong, Jiacheng Xiong, Shengkun Ni, Duanhua Cao, Buying Niu, Mingan Chen, Yameng Li, Runze Zhang, Yitian Wang, Lehan Zhang, Xutong Li, Zhaoqing Xiong, Qian Shi, Ziming Huang, Zunyun Fu, Mingyue Zheng
This study explores the effectiveness of fine-tuning large language models (LLMs) for chemical text mining, focusing on five intricate tasks: compound entity recognition, reaction role labeling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance (NMR) data extraction, and converting reaction paragraphs to action sequences. The fine-tuned LLMs, particularly GPT-3.5-turbo, demonstrated impressive performance, achieving exact accuracy levels ranging from 69% to 95% with minimal annotated data. These models outperformed task-adaptive pre-trained and fine-tuned models based on larger in-domain data. Notably, fine-tuned models like Mistral and Llama3 also showed competitive abilities. The study highlights the potential of fine-tuning LLMs as versatile and robust toolkits for automated data acquisition in chemical knowledge extraction, offering significant advantages in terms of versatility, robustness, and low-code capability. This approach can revolutionize chemical text mining by optimizing labor-intensive and time-consuming data collection workflows, accelerating the discovery and creation of novel substances.This study explores the effectiveness of fine-tuning large language models (LLMs) for chemical text mining, focusing on five intricate tasks: compound entity recognition, reaction role labeling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance (NMR) data extraction, and converting reaction paragraphs to action sequences. The fine-tuned LLMs, particularly GPT-3.5-turbo, demonstrated impressive performance, achieving exact accuracy levels ranging from 69% to 95% with minimal annotated data. These models outperformed task-adaptive pre-trained and fine-tuned models based on larger in-domain data. Notably, fine-tuned models like Mistral and Llama3 also showed competitive abilities. The study highlights the potential of fine-tuning LLMs as versatile and robust toolkits for automated data acquisition in chemical knowledge extraction, offering significant advantages in terms of versatility, robustness, and low-code capability. This approach can revolutionize chemical text mining by optimizing labor-intensive and time-consuming data collection workflows, accelerating the discovery and creation of novel substances.
Reach us at info@study.space