24 Jan 2024 | KATY ILONKA GERO, CHELSE SWOOPES, ZIWEI GU, JONATHAN K. KUMMERFELD, ELENA L. GLASSMAN
The paper "Supporting Sensemaking of Large Language Model Outputs at Scale" by Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K. Kummerfeld, and Elena L. Glassman explores methods to help users effectively manage and understand multiple outputs from large language models (LLMs). The authors identify a need for tools that can support users in making sense of the vast number of responses generated by LLMs, which is often challenging due to the complexity and variety of these outputs. They design and evaluate five features, including both existing and novel methods for text analysis and rendering, to enhance the user experience.
The features include:
1. **Exact Matches**: Highlighting common substrings across responses.
2. **Unique Words**: Highlighting unique words in each response.
3. **Positional Diction Clustering (PDC)**: Grouping sentences that are semantically and structurally similar, both in content and position within responses.
4. **Grid Layout**: Presenting responses in a grid format, allowing users to compare multiple responses side by side.
5. **Interleaved Layout**: Grouping similar sentences and interleaving them, making it easier to identify and remix elements from different responses.
A controlled user study (n=24) and eight case studies were conducted to evaluate these features. The results show that the features significantly support various sensemaking tasks, such as email rewriting, model comparison, and identifying differences between responses. Participants found the grid layout and highlighting features particularly helpful, while some found the exploratory interface overwhelming due to the amount of information presented at once.
The paper also discusses the theoretical foundations of the features, drawing on Variation Theory and Analogical Learning Theory to explain how these features help users form mental models of LLM behavior. The authors provide design guidelines for future work on LLM response inspectors, emphasizing the importance of scaling up human inspection and supporting a wide range of tasks.The paper "Supporting Sensemaking of Large Language Model Outputs at Scale" by Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K. Kummerfeld, and Elena L. Glassman explores methods to help users effectively manage and understand multiple outputs from large language models (LLMs). The authors identify a need for tools that can support users in making sense of the vast number of responses generated by LLMs, which is often challenging due to the complexity and variety of these outputs. They design and evaluate five features, including both existing and novel methods for text analysis and rendering, to enhance the user experience.
The features include:
1. **Exact Matches**: Highlighting common substrings across responses.
2. **Unique Words**: Highlighting unique words in each response.
3. **Positional Diction Clustering (PDC)**: Grouping sentences that are semantically and structurally similar, both in content and position within responses.
4. **Grid Layout**: Presenting responses in a grid format, allowing users to compare multiple responses side by side.
5. **Interleaved Layout**: Grouping similar sentences and interleaving them, making it easier to identify and remix elements from different responses.
A controlled user study (n=24) and eight case studies were conducted to evaluate these features. The results show that the features significantly support various sensemaking tasks, such as email rewriting, model comparison, and identifying differences between responses. Participants found the grid layout and highlighting features particularly helpful, while some found the exploratory interface overwhelming due to the amount of information presented at once.
The paper also discusses the theoretical foundations of the features, drawing on Variation Theory and Analogical Learning Theory to explain how these features help users form mental models of LLM behavior. The authors provide design guidelines for future work on LLM response inspectors, emphasizing the importance of scaling up human inspection and supporting a wide range of tasks.