This paper compares the data labeling accuracy of GPT-4 and an ethical, well-executed Amazon Mechanical Turk (MTurk) pipeline. The study involved 415 MTurk workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces were used, yielding 127,080 labels, which were then aggregated using eight label-aggregation algorithms. The results showed that while the MTurk pipeline achieved a highest accuracy of 81.5%, GPT-4 achieved 83.6%. Interestingly, combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation resulted in two out of eight algorithms achieving even higher accuracy (87.5% and 87.0%). Further analysis suggested that when the crowd's and GPT-4's labeling strengths are complementary, aggregating them can increase labeling accuracy. The study highlights the importance of considering holistic crowdsourcing practices and the potential benefits of integrating GPT-4 into MTurk pipelines to enhance labeling accuracy.This paper compares the data labeling accuracy of GPT-4 and an ethical, well-executed Amazon Mechanical Turk (MTurk) pipeline. The study involved 415 MTurk workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces were used, yielding 127,080 labels, which were then aggregated using eight label-aggregation algorithms. The results showed that while the MTurk pipeline achieved a highest accuracy of 81.5%, GPT-4 achieved 83.6%. Interestingly, combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation resulted in two out of eight algorithms achieving even higher accuracy (87.5% and 87.0%). Further analysis suggested that when the crowd's and GPT-4's labeling strengths are complementary, aggregating them can increase labeling accuracy. The study highlights the importance of considering holistic crowdsourcing practices and the potential benefits of integrating GPT-4 into MTurk pipelines to enhance labeling accuracy.