Understanding Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

This paper addresses the issue of hallucination and flaws in the reasoning processes of Large Language Models (LLMs) by proposing a framework for learning planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories. The authors aim to improve the reliability and faithfulness of generated rationales by synthesizing process rewards from outcome annotations, incorporating offline simulation and trajectory collection. The proposed method involves collecting full solution trajectories, sampling intermediate reasoning states, and using DPO to optimize the policy model based on synthesized process rewards. The effectiveness of the approach is demonstrated through experiments on logical and mathematical reasoning benchmarks, showing significant improvements over baseline models. The results highlight the benefits of the proposed method in generating more reliable and concise rationales, reducing the reliance on human annotations, and achieving better performance with less training time.This paper addresses the issue of hallucination and flaws in the reasoning processes of Large Language Models (LLMs) by proposing a framework for learning planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories. The authors aim to improve the reliability and faithfulness of generated rationales by synthesizing process rewards from outcome annotations, incorporating offline simulation and trajectory collection. The proposed method involves collecting full solution trajectories, sampling intermediate reasoning states, and using DPO to optimize the policy model based on synthesized process rewards. The effectiveness of the approach is demonstrated through experiments on logical and mathematical reasoning benchmarks, showing significant improvements over baseline models. The results highlight the benefits of the proposed method in generating more reliable and concise rationales, reducing the reliance on human annotations, and achieving better performance with less training time.

Learning Planning-based Reasoning via Trajectories Collection and Process Reward Synthesizing

15 Apr 2024 | Fangkai Jiao, Chengwei Qin, Zhengyuan Liu, Nancy F. Chen, Shafiq Joty