If in a Crowdsourced Data Annotation Pipeline, a GPT-4

If in a Crowdsourced Data Annotation Pipeline, a GPT-4

May 11–16, 2024 | Zeyu He, Chieh-Yang Huang, Chien-Kuang Cornelia Ding, Shaurya Rohatgi, Ting-Hao 'Kenneth' Huang
This paper compares the data labeling accuracy of GPT-4 and a well-executed, ethical MTurk pipeline. The study involved 415 MTurk workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces yielded 127,080 labels, which were then used to infer final labels through eight label-aggregation algorithms. The evaluation showed that despite best practices, the MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%. Interestingly, when combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, two out of the eight algorithms achieved even higher accuracy (87.5% and 87.0%). Further analysis suggested that when the crowd's and GPT-4's labeling strengths are complementary, aggregating them could increase labeling accuracy. The study highlights the value of crowdsourced labels in scenarios where GPT-4's accuracy generally outperforms yet complements crowd efforts, demonstrating that adding crowd labels can further enhance accuracy. It also sheds light on the evolving role and best practices for crowdsourcing in the era of Large Language Models (LLMs), particularly when LLMs often exhibit superior labeling accuracy compared to crowd workers. The study used a variety of label aggregation techniques, including Majority Voting and Dawid-Skene, to determine the final labels and compared the labeling accuracies with GPT-4. The results showed that even with the best crowdsourcing practices, MTurk's top-performing pipeline's accuracy of 81.5% did not surpass GPT-4's 83.6%. When combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, two out of the eight algorithms achieved higher accuracy (87.5% and 87.0%) compared to GPT-4's standalone performance (83.6%). The study also explored different label-cleaning strategies and found that Exclude-By-Worker was the best strategy. The results showed that while more crowd labels improved accuracy, even aggregating 20 could not surpass GPT-4. The study also found that combining GPT-4 with crowd labels collected via the Advanced Interface could potentially exceed GPT-4's solo performance. The findings suggest that even a handful of crowd labels can be beneficial. The study also conducted t-test analyses to compare GPT-4 across various settings, including different aggregation methods, cleaning strategies, and user interfaces. The results showed that the integration of GPT-4 with the Advanced Interface, particularly with the One-Coin Dawid-Skene and MACE methods, significantly outperformed the pure GPT-4 setting. The study also analyzed cases where crowd labels enhanced GPT's label accuracy and found that the "Finding" classThis paper compares the data labeling accuracy of GPT-4 and a well-executed, ethical MTurk pipeline. The study involved 415 MTurk workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces yielded 127,080 labels, which were then used to infer final labels through eight label-aggregation algorithms. The evaluation showed that despite best practices, the MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%. Interestingly, when combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, two out of the eight algorithms achieved even higher accuracy (87.5% and 87.0%). Further analysis suggested that when the crowd's and GPT-4's labeling strengths are complementary, aggregating them could increase labeling accuracy. The study highlights the value of crowdsourced labels in scenarios where GPT-4's accuracy generally outperforms yet complements crowd efforts, demonstrating that adding crowd labels can further enhance accuracy. It also sheds light on the evolving role and best practices for crowdsourcing in the era of Large Language Models (LLMs), particularly when LLMs often exhibit superior labeling accuracy compared to crowd workers. The study used a variety of label aggregation techniques, including Majority Voting and Dawid-Skene, to determine the final labels and compared the labeling accuracies with GPT-4. The results showed that even with the best crowdsourcing practices, MTurk's top-performing pipeline's accuracy of 81.5% did not surpass GPT-4's 83.6%. When combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, two out of the eight algorithms achieved higher accuracy (87.5% and 87.0%) compared to GPT-4's standalone performance (83.6%). The study also explored different label-cleaning strategies and found that Exclude-By-Worker was the best strategy. The results showed that while more crowd labels improved accuracy, even aggregating 20 could not surpass GPT-4. The study also found that combining GPT-4 with crowd labels collected via the Advanced Interface could potentially exceed GPT-4's solo performance. The findings suggest that even a handful of crowd labels can be beneficial. The study also conducted t-test analyses to compare GPT-4 across various settings, including different aggregation methods, cleaning strategies, and user interfaces. The results showed that the integration of GPT-4 with the Advanced Interface, particularly with the One-Coin Dawid-Skene and MACE methods, significantly outperformed the pure GPT-4 setting. The study also analyzed cases where crowd labels enhanced GPT's label accuracy and found that the "Finding" class
Reach us at info@study.space
[slides and audio] If in a Crowdsourced Data Annotation Pipeline%2C a GPT-4