24 Apr 2024 | Jacob Pfau, William Merrill & Samuel R. Bowman
The paper "Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models" by Jacob Pfau, William Merrill, and Samuel R. Bowman explores the role of filler tokens in language models (LMs) and their impact on performance. The authors investigate whether the improved performance observed in chain-of-thought (CoT) responses can be attributed to human-like task decomposition or simply the additional computation provided by intermediate tokens. They find that transformers can solve complex algorithmic tasks using meaningless filler tokens, such as repeated dots ('...'), even when no CoT is provided. However, learning to use filler tokens is challenging and requires dense supervision. The study also provides a theoretical characterization of problems where filler tokens are useful, based on the quantifier depth of first-order formulas. The results show that additional tokens can provide computational benefits independent of token choice, raising concerns about large language models engaging in hidden computations that are not transparent to users. The paper contributes to understanding the expressive power of transformers and the conditions under which they can benefit from filler tokens.The paper "Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models" by Jacob Pfau, William Merrill, and Samuel R. Bowman explores the role of filler tokens in language models (LMs) and their impact on performance. The authors investigate whether the improved performance observed in chain-of-thought (CoT) responses can be attributed to human-like task decomposition or simply the additional computation provided by intermediate tokens. They find that transformers can solve complex algorithmic tasks using meaningless filler tokens, such as repeated dots ('...'), even when no CoT is provided. However, learning to use filler tokens is challenging and requires dense supervision. The study also provides a theoretical characterization of problems where filler tokens are useful, based on the quantifier depth of first-order formulas. The results show that additional tokens can provide computational benefits independent of token choice, raising concerns about large language models engaging in hidden computations that are not transparent to users. The paper contributes to understanding the expressive power of transformers and the conditions under which they can benefit from filler tokens.