22 Feb 2024 | Dimitri von Rütte, Sotiris Anagnostidis, Gregor Bachmann, Thomas Hofmann
This paper explores the concept of concept guidance in large language models (LLMs), focusing on controlling the behavior of these models by manipulating their hidden representations. While previous work has primarily concentrated on the concept of *truthfulness*, this paper extends the framework to include other concepts such as *appropriateness*, *humor*, *creativity*, and *quality*. The authors develop a novel metric, *perplexity-normalized effect size* (PNES), to evaluate the success of concept elicitation and the potential degradation in fluency of the guided model. Extensive experiments reveal that while some concepts like *truthfulness* are robustly guidable, others like *appropriateness* remain difficult to elicit or require extensive tuning. Surprisingly, probes with optimal detection accuracies do not necessarily make for the best guides, contradicting previous findings for *truthfulness*. The study highlights the need for a deeper investigation into the interplay between detectability, guidability, and the nature of the concept, and provides a rich experimental framework for future research in concept guidance.This paper explores the concept of concept guidance in large language models (LLMs), focusing on controlling the behavior of these models by manipulating their hidden representations. While previous work has primarily concentrated on the concept of *truthfulness*, this paper extends the framework to include other concepts such as *appropriateness*, *humor*, *creativity*, and *quality*. The authors develop a novel metric, *perplexity-normalized effect size* (PNES), to evaluate the success of concept elicitation and the potential degradation in fluency of the guided model. Extensive experiments reveal that while some concepts like *truthfulness* are robustly guidable, others like *appropriateness* remain difficult to elicit or require extensive tuning. Surprisingly, probes with optimal detection accuracies do not necessarily make for the best guides, contradicting previous findings for *truthfulness*. The study highlights the need for a deeper investigation into the interplay between detectability, guidability, and the nature of the concept, and provides a rich experimental framework for future research in concept guidance.