PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software

PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software

April 2024 | Wenxin Jiang, Jerin Yasmin, Jason Jones, Nicholas Synovic, Jiashen Kuo, Nathaniel Bielanski, Yuan Tian, George K. Thiruvathukal, James C. Davis
PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software This paper presents the PeaTMOSS dataset, which includes metadata for 281,638 pre-trained models (PTMs) and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models. Additionally, the dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset's comprehensiveness, we developed prompts for a large language model to automatically extract model metadata, including the model's training datasets, parameters, and evaluation metrics. Our analysis of this dataset provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation. Our example application reveals inconsistencies in software licenses across PTMs and their dependent projects. PeaTMOSS lays the foundation for future research, offering rich opportunities to investigate the PTM supply chain. We outline mining opportunities on PTMs, their downstream usage, and cross-cutting questions. The PeaTMOSS dataset includes 281,638 PTM packages and 28,575 downstream GitHub repositories. We tackled the issue of unstructured attributes by developing a LLM-based tool for metadata extraction, which enhances our dataset by adding structured data in JSON format. We provide the first summary statistics of this PTM supply chain, encompassing distributions of PTMs and their downstream repositories across various problem domains. Our analysis also includes trends in model size and the quantity of PTM packages, along with an overview of the proportion of available metadata. We show the proportion of missing data in each PTM metadata category. We applied our dataset to assess the compatibility of PTMs with downstream GitHub repositories. Our findings reveal that 0.24% of these licenses are inconsistent, potentially causing community confusion and hindering collaboration. PeaTMOSS is a comprehensive dataset for PTM in open-source software. It offers an extensive mapping between PTM packages and downstream GitHub repositories, and many queryable metadata. Using PeaTMOSS, researchers can study the PTM supply chain and the reuse modes of PTM packages. Engineering tools can be developed for PTM reuse, e.g., for model search and comparison. The PeaTMOSS dataset is available at https://github.com/PurdueDualityLab/PeaTMOSS-Artifact. Our dataset is available at https://transfer.rcac.purdue.edu/file-manager?origin_id=ff978999-16c2-4b50-ac7a-947ffc3eb1d&origin_path=%2FPeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software This paper presents the PeaTMOSS dataset, which includes metadata for 281,638 pre-trained models (PTMs) and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models. Additionally, the dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset's comprehensiveness, we developed prompts for a large language model to automatically extract model metadata, including the model's training datasets, parameters, and evaluation metrics. Our analysis of this dataset provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation. Our example application reveals inconsistencies in software licenses across PTMs and their dependent projects. PeaTMOSS lays the foundation for future research, offering rich opportunities to investigate the PTM supply chain. We outline mining opportunities on PTMs, their downstream usage, and cross-cutting questions. The PeaTMOSS dataset includes 281,638 PTM packages and 28,575 downstream GitHub repositories. We tackled the issue of unstructured attributes by developing a LLM-based tool for metadata extraction, which enhances our dataset by adding structured data in JSON format. We provide the first summary statistics of this PTM supply chain, encompassing distributions of PTMs and their downstream repositories across various problem domains. Our analysis also includes trends in model size and the quantity of PTM packages, along with an overview of the proportion of available metadata. We show the proportion of missing data in each PTM metadata category. We applied our dataset to assess the compatibility of PTMs with downstream GitHub repositories. Our findings reveal that 0.24% of these licenses are inconsistent, potentially causing community confusion and hindering collaboration. PeaTMOSS is a comprehensive dataset for PTM in open-source software. It offers an extensive mapping between PTM packages and downstream GitHub repositories, and many queryable metadata. Using PeaTMOSS, researchers can study the PTM supply chain and the reuse modes of PTM packages. Engineering tools can be developed for PTM reuse, e.g., for model search and comparison. The PeaTMOSS dataset is available at https://github.com/PurdueDualityLab/PeaTMOSS-Artifact. Our dataset is available at https://transfer.rcac.purdue.edu/file-manager?origin_id=ff978999-16c2-4b50-ac7a-947ffc3eb1d&origin_path=%2F
Reach us at info@study.space