This paper investigates the internal representations and world models formed by language models when trained on real-world data, specifically chess games. Building on previous work by Li et al. (2023a) that trained a language model on synthetic Othello games, the author extends this approach to chess, a more complex domain. The study uses linear probes and contrastive activations to explore the model's internal representations of the board state and latent variables like player skill.
Key findings include:
1. **Internal Board State Representation**: The model learns to represent the board state, as evidenced by high accuracy in linear probes trained on board state classification tasks.
2. **Latent Variable Estimation**: The model also learns to estimate player skill, which it uses to improve its chess playing ability. Adding a player skill vector to the model significantly improves its win rate against Stockfish, a competitive chess AI.
3. **Model Interventions**: The author demonstrates causal interventions on the model's activations, showing that modifying the internal board state and player skill can influence the model's moves and performance.
The study provides valuable insights into how large language models can form world models and extract high-level concepts from complex data, even in a constrained setting like chess. The findings suggest that language models can develop sophisticated internal representations and make meaningful interventions, with potential applications in areas such as interpretability and model improvement.This paper investigates the internal representations and world models formed by language models when trained on real-world data, specifically chess games. Building on previous work by Li et al. (2023a) that trained a language model on synthetic Othello games, the author extends this approach to chess, a more complex domain. The study uses linear probes and contrastive activations to explore the model's internal representations of the board state and latent variables like player skill.
Key findings include:
1. **Internal Board State Representation**: The model learns to represent the board state, as evidenced by high accuracy in linear probes trained on board state classification tasks.
2. **Latent Variable Estimation**: The model also learns to estimate player skill, which it uses to improve its chess playing ability. Adding a player skill vector to the model significantly improves its win rate against Stockfish, a competitive chess AI.
3. **Model Interventions**: The author demonstrates causal interventions on the model's activations, showing that modifying the internal board state and player skill can influence the model's moves and performance.
The study provides valuable insights into how large language models can form world models and extract high-level concepts from complex data, even in a constrained setting like chess. The findings suggest that language models can develop sophisticated internal representations and make meaningful interventions, with potential applications in areas such as interpretability and model improvement.