LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

24 Jul 2024 | Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, and Shanghang Zhang
This paper introduces a novel framework called Subpopulation Structure Discovery with Large Language Models (SSD-LLM) to automatically uncover subpopulation structures within datasets. The framework leverages the world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize subpopulation structures. SSD-LLM employs a two-step process: first, using a Multimodal Large Language Model (MLLM) to generate informative captions from images, and then using an LLM to analyze and summarize the subpopulation structure of the dataset. The framework includes two elaborate prompt engineering components: Criteria Initialization and Criteria Refinement. Criteria Initialization uses a generate-and-select paradigm to summarize dimensions and attributes sequentially, while Criteria Refinement employs self-consistency as an indicator to evaluate and refine the criteria. After obtaining complete criteria, each image is assigned to corresponding attributes according to its caption. The final subpopulation structures can be leveraged to finish various downstream tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. The framework achieves an improvement of +3.3% in worst group accuracy across three datasets compared to SOTA methods on subpopulation shift benchmark Waterbirds, Metashift and Nico++, and also identifies more consistent slice topics with a higher model error rate of 3.95% on slice discovery task for ImageNet. The code will be available at https://llm-as-dataset-analyst.github.io/. The paper also discusses related works, including hierarchical structure of image datasets, extracting information from image captions, and LLM prompt engineering. The method is evaluated on three tasks: Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery, demonstrating its effectiveness in identifying and analyzing subgroups, further affirming its utility in addressing related challenges. The results show that SSD-LLM outperforms previous methods in terms of accuracy and error rate, and the framework can be applied to various downstream tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. The paper concludes that SSD-LLM provides a systematic exploration of subpopulation structure discovery and has the potential to guide dataset construction with better fairness or further supporting the construction of unbiased datasets.This paper introduces a novel framework called Subpopulation Structure Discovery with Large Language Models (SSD-LLM) to automatically uncover subpopulation structures within datasets. The framework leverages the world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize subpopulation structures. SSD-LLM employs a two-step process: first, using a Multimodal Large Language Model (MLLM) to generate informative captions from images, and then using an LLM to analyze and summarize the subpopulation structure of the dataset. The framework includes two elaborate prompt engineering components: Criteria Initialization and Criteria Refinement. Criteria Initialization uses a generate-and-select paradigm to summarize dimensions and attributes sequentially, while Criteria Refinement employs self-consistency as an indicator to evaluate and refine the criteria. After obtaining complete criteria, each image is assigned to corresponding attributes according to its caption. The final subpopulation structures can be leveraged to finish various downstream tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. The framework achieves an improvement of +3.3% in worst group accuracy across three datasets compared to SOTA methods on subpopulation shift benchmark Waterbirds, Metashift and Nico++, and also identifies more consistent slice topics with a higher model error rate of 3.95% on slice discovery task for ImageNet. The code will be available at https://llm-as-dataset-analyst.github.io/. The paper also discusses related works, including hierarchical structure of image datasets, extracting information from image captions, and LLM prompt engineering. The method is evaluated on three tasks: Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery, demonstrating its effectiveness in identifying and analyzing subgroups, further affirming its utility in addressing related challenges. The results show that SSD-LLM outperforms previous methods in terms of accuracy and error rate, and the framework can be applied to various downstream tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. The paper concludes that SSD-LLM provides a systematic exploration of subpopulation structure discovery and has the potential to guide dataset construction with better fairness or further supporting the construction of unbiased datasets.
Reach us at info@study.space
[slides] LLM as Dataset Analyst%3A Subpopulation Structure Discovery with Large Language Model | StudySpace