Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia

Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia

May 11–16, 2024 | Tzu-Sheng Kuo, Aaron Halfaker, Zirui Cheng, Jiwoo Kim, Meng-Hsin Wu, Kenneth Holstein, Tongshuang Wu, Haiyi Zhu
Wikibench is a system designed to enable community-driven data curation for AI evaluation on Wikipedia. The system allows community members to collaboratively curate AI evaluation datasets, navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation. Wikibench supports community members in selecting data points for inclusion in datasets, labeling data points with "individual" labels based on their own initial judgments, and then discussing their perspectives to collectively decide on a "primary" label for the data point. Through discussion, participants may resolve some disagreements or clarify ambiguities in labeling, leading to changes in their individual labels. In addition, community members decide on a primary label for each data point, forming a consensus-based decision boundary. Wikibench datasets preserve information about disagreement among community members. The system is designed to capture community consensus, disagreement, and uncertainty. Wikibench records two types of labels for each data point: individual labels, which reflect individual perspectives, and primary labels, which reflect a consensus view. The system also records labelers' self-reported confidence associated with each individual label. In aggregate, confidence indications can provide a signal of the uncertainty associated with a data point. Wikibench's design includes three user interfaces: a plug-in for selecting and labeling new data points, an entity page for labeling and discussing collected data points, and a campaign page for selecting collected data points for labeling and discussion, or discussing the overall curation process. The plug-in allows Wikipediaans to label edits while they are already in the midst of assessing them. The entity page publicly shows the labels of individual edits and facilitates discussions and (re-)labeling. The campaign page publicly shows the entire dataset and surfaces edits that could benefit from additional attention. The current implementation of Wikibench is built upon Wikipedia's infrastructure to ensure its user interfaces and norms are familiar to Wikipedians. The system uses Wikipedia's user script feature to re-render these article pages on the front-end. The plug-in is also a front-end element embedded in Wikipedia's existing diff page. Wikibench uses Wikipedia's OOUI and design system to ensure familiarity with Wikipedia's existing interface. The creation and revision of Wikibench's labels are enabled through MediaWiki API. Importantly, Wikibench's campaign and entity pages are kept within an author's user sandbox to minimize disruption to the site. A field study was conducted to observe how Wikipedians use Wikibench in the course of their regular activities on Wikipedia. Participants were asked to submit a minimum of 10 labels and to engage in at least 3 discussions per day using WikibWikibench is a system designed to enable community-driven data curation for AI evaluation on Wikipedia. The system allows community members to collaboratively curate AI evaluation datasets, navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation. Wikibench supports community members in selecting data points for inclusion in datasets, labeling data points with "individual" labels based on their own initial judgments, and then discussing their perspectives to collectively decide on a "primary" label for the data point. Through discussion, participants may resolve some disagreements or clarify ambiguities in labeling, leading to changes in their individual labels. In addition, community members decide on a primary label for each data point, forming a consensus-based decision boundary. Wikibench datasets preserve information about disagreement among community members. The system is designed to capture community consensus, disagreement, and uncertainty. Wikibench records two types of labels for each data point: individual labels, which reflect individual perspectives, and primary labels, which reflect a consensus view. The system also records labelers' self-reported confidence associated with each individual label. In aggregate, confidence indications can provide a signal of the uncertainty associated with a data point. Wikibench's design includes three user interfaces: a plug-in for selecting and labeling new data points, an entity page for labeling and discussing collected data points, and a campaign page for selecting collected data points for labeling and discussion, or discussing the overall curation process. The plug-in allows Wikipediaans to label edits while they are already in the midst of assessing them. The entity page publicly shows the labels of individual edits and facilitates discussions and (re-)labeling. The campaign page publicly shows the entire dataset and surfaces edits that could benefit from additional attention. The current implementation of Wikibench is built upon Wikipedia's infrastructure to ensure its user interfaces and norms are familiar to Wikipedians. The system uses Wikipedia's user script feature to re-render these article pages on the front-end. The plug-in is also a front-end element embedded in Wikipedia's existing diff page. Wikibench uses Wikipedia's OOUI and design system to ensure familiarity with Wikipedia's existing interface. The creation and revision of Wikibench's labels are enabled through MediaWiki API. Importantly, Wikibench's campaign and entity pages are kept within an author's user sandbox to minimize disruption to the site. A field study was conducted to observe how Wikipedians use Wikibench in the course of their regular activities on Wikipedia. Participants were asked to submit a minimum of 10 labels and to engage in at least 3 discussions per day using Wikib
Reach us at info@study.space
[slides and audio] Wikibench%3A Community-Driven Data Curation for AI Evaluation on Wikipedia