[slides] Prover-Verifier Games improve legibility of LLM outputs

The paper "Prover-Verifier Games Improve Legibility of LLM Outputs" by Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda from OpenAI explores methods to enhance the legibility of outputs from Large Language Models (LLMs). The authors propose a training algorithm inspired by the Prover-Verifier Game (PVG) to improve the clarity and checkability of LLM solutions. They find that optimizing solutions for correctness alone can lead to less legible outputs, which are difficult for humans to evaluate within time constraints. To address this, they introduce checkability training, which involves iteratively training small verifiers to predict solution correctness, "helpful" provers to produce correct solutions accepted by the verifier, and "sneaky" provers to produce incorrect solutions that fool the verifier. The results show that the helpful prover's accuracy and the verifier's robustness increase over training, and the legibility of solutions transfers to humans tasked with verifying correctness. The study suggests that training LLMs to be more legible can help with alignment and human oversight, particularly in high-stakes applications.The paper "Prover-Verifier Games Improve Legibility of LLM Outputs" by Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda from OpenAI explores methods to enhance the legibility of outputs from Large Language Models (LLMs). The authors propose a training algorithm inspired by the Prover-Verifier Game (PVG) to improve the clarity and checkability of LLM solutions. They find that optimizing solutions for correctness alone can lead to less legible outputs, which are difficult for humans to evaluate within time constraints. To address this, they introduce checkability training, which involves iteratively training small verifiers to predict solution correctness, "helpful" provers to produce correct solutions accepted by the verifier, and "sneaky" provers to produce incorrect solutions that fool the verifier. The results show that the helpful prover's accuracy and the verifier's robustness increase over training, and the legibility of solutions transfers to humans tasked with verifying correctness. The study suggests that training LLMs to be more legible can help with alignment and human oversight, particularly in high-stakes applications.

Prover-Verifier Games Improve Legibility of LLM Outputs

1 Aug 2024 | Jan Hendrik Kirchner*, Yining Chen*, Harri Edwards†, Jan Leike†, Nat McAleese, Yuri Burda†

1 Aug 2024 | Jan Hendrik Kirchner, Yining Chen, Harri Edwards†, Jan Leike†, Nat McAleese, Yuri Burda†