Prover-Verifier Games Improve Legibility of LLM Outputs

Prover-Verifier Games Improve Legibility of LLM Outputs

1 Aug 2024 | Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, Yuri Burda
Prover-verifier games improve the legibility of large language model (LLM) outputs. The study explores how to enhance the legibility of LLM outputs by training them with a prover-verifier game, where a verifier checks the correctness of solutions generated by a prover. The goal is to ensure that the solutions are not only correct but also easy for humans to understand and verify. The research shows that optimizing for correctness alone can reduce legibility, and that using a prover-verifier game can improve both correctness and legibility. The prover-verifier game involves training a verifier to detect correct solutions and a prover to generate correct solutions that the verifier accepts. The prover is also trained to generate incorrect solutions that fool the verifier. The study finds that the helpful prover's accuracy and the verifier's robustness to adversarial attacks increase over time. Additionally, the legibility training transfers to humans, as human accuracy increases when checking the helpful prover's solutions and decreases when checking the sneaky prover's solutions. The study also shows that the prover-verifier game can be used to train LLMs to be more legible to humans, which is important for alignment with superhuman models. The results suggest that legibility training against small verifiers is a practical avenue for increasing the legibility of large LLMs to humans. The prover-verifier game is a promising candidate for a scalable oversight method.Prover-verifier games improve the legibility of large language model (LLM) outputs. The study explores how to enhance the legibility of LLM outputs by training them with a prover-verifier game, where a verifier checks the correctness of solutions generated by a prover. The goal is to ensure that the solutions are not only correct but also easy for humans to understand and verify. The research shows that optimizing for correctness alone can reduce legibility, and that using a prover-verifier game can improve both correctness and legibility. The prover-verifier game involves training a verifier to detect correct solutions and a prover to generate correct solutions that the verifier accepts. The prover is also trained to generate incorrect solutions that fool the verifier. The study finds that the helpful prover's accuracy and the verifier's robustness to adversarial attacks increase over time. Additionally, the legibility training transfers to humans, as human accuracy increases when checking the helpful prover's solutions and decreases when checking the sneaky prover's solutions. The study also shows that the prover-verifier game can be used to train LLMs to be more legible to humans, which is important for alignment with superhuman models. The results suggest that legibility training against small verifiers is a practical avenue for increasing the legibility of large LLMs to humans. The prover-verifier game is a promising candidate for a scalable oversight method.
Reach us at info@study.space