1 Feb 2024 | Wenxin Jiang, Jerin Yasmin, Jason Jones, Nicholas Synovic, Jiashen Kuo, Nathaniel Bielanski, Yuan Tian, George K. Thiruvathukal, James C. Davis
The paper introduces *PeaTMOSS*, a comprehensive dataset that documents the metadata and usage of pre-trained models (PTMs) in open-source software. The dataset includes 281,638 PTMs, detailed snapshots of 14,296 PTMs with over 50 monthly downloads, and 28,575 GitHub repositories that utilize these models. Additionally, it contains 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset's comprehensiveness, the authors developed a large language model (LLM) to automatically extract model metadata, including training datasets, parameters, and evaluation metrics. The analysis of *PeaTMOSS* provides the first summary statistics on the PTM supply chain, revealing trends in PTM development and common shortcomings in package documentation. The dataset also highlights inconsistencies in software licenses across PTMs and their dependent projects. *PeaTMOSS* lays the foundation for future research on PTM supply chain dynamics, offering rich opportunities for investigating PTM reuse and downstream applications. The paper outlines mining opportunities and potential directions for future work, emphasizing the importance of structured metadata in understanding and improving the PTM ecosystem.The paper introduces *PeaTMOSS*, a comprehensive dataset that documents the metadata and usage of pre-trained models (PTMs) in open-source software. The dataset includes 281,638 PTMs, detailed snapshots of 14,296 PTMs with over 50 monthly downloads, and 28,575 GitHub repositories that utilize these models. Additionally, it contains 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset's comprehensiveness, the authors developed a large language model (LLM) to automatically extract model metadata, including training datasets, parameters, and evaluation metrics. The analysis of *PeaTMOSS* provides the first summary statistics on the PTM supply chain, revealing trends in PTM development and common shortcomings in package documentation. The dataset also highlights inconsistencies in software licenses across PTMs and their dependent projects. *PeaTMOSS* lays the foundation for future research on PTM supply chain dynamics, offering rich opportunities for investigating PTM reuse and downstream applications. The paper outlines mining opportunities and potential directions for future work, emphasizing the importance of structured metadata in understanding and improving the PTM ecosystem.