This paper presents a comprehensive analysis of dataset documentation on Hugging Face, a prominent platform for sharing and collaborating on ML models and datasets. The study examines 7,433 dataset cards to understand the current practices and challenges in dataset documentation. Key findings include:
1. **Heterogeneity in Dataset Card Completion**: The completion rate of dataset cards varies significantly based on dataset popularity. While 86.0% of the top 100 downloaded datasets have all sections filled out, only 7.9% of less popular datasets do so.
2. **Section Prioritization**: Practitioners prioritize the *Dataset Description* and *Dataset Structure* sections, which account for 36.2% and 33.6% of the total card length, respectively. In contrast, the *Considerations for Using the Data* section receives the least attention, with only 2.1% of the content.
3. **Content Dynamics**: The *Considerations for Using the Data* section, though often overlooked, covers important topics such as social impact, biases, and limitations. Topic modeling reveals that this section discusses technical and social aspects of dataset limitations and impact.
4. **Importance of Usage Sections**: The inclusion of a *Usage* section significantly impacts dataset popularity, with a counterfactual analysis showing a 1.85% decrease in downloads when this section is removed.
5. **Human Evaluation**: Human annotations emphasize the importance of comprehensive dataset content in shaping perceptions of dataset card quality. Content comprehensiveness is strongly correlated with overall quality, highlighting the need for detailed documentation.
The study underscores the need for more thorough and comprehensive dataset documentation to enhance transparency, reproducibility, and accessibility in machine learning research.This paper presents a comprehensive analysis of dataset documentation on Hugging Face, a prominent platform for sharing and collaborating on ML models and datasets. The study examines 7,433 dataset cards to understand the current practices and challenges in dataset documentation. Key findings include:
1. **Heterogeneity in Dataset Card Completion**: The completion rate of dataset cards varies significantly based on dataset popularity. While 86.0% of the top 100 downloaded datasets have all sections filled out, only 7.9% of less popular datasets do so.
2. **Section Prioritization**: Practitioners prioritize the *Dataset Description* and *Dataset Structure* sections, which account for 36.2% and 33.6% of the total card length, respectively. In contrast, the *Considerations for Using the Data* section receives the least attention, with only 2.1% of the content.
3. **Content Dynamics**: The *Considerations for Using the Data* section, though often overlooked, covers important topics such as social impact, biases, and limitations. Topic modeling reveals that this section discusses technical and social aspects of dataset limitations and impact.
4. **Importance of Usage Sections**: The inclusion of a *Usage* section significantly impacts dataset popularity, with a counterfactual analysis showing a 1.85% decrease in downloads when this section is removed.
5. **Human Evaluation**: Human annotations emphasize the importance of comprehensive dataset content in shaping perceptions of dataset card quality. Content comprehensiveness is strongly correlated with overall quality, highlighting the need for detailed documentation.
The study underscores the need for more thorough and comprehensive dataset documentation to enhance transparency, reproducibility, and accessibility in machine learning research.