Learning Planning-based Reasoning via Trajectories Collection and Process Reward Synthesizing

Learning Planning-based Reasoning via Trajectories Collection and Process Reward Synthesizing

15 Apr 2024 | Fangkai Jiao, Chengwei Qi, Zhengyuan Liu, Nancy F. Chen, Shafiq Joty
This paper proposes a framework for learning planning-based reasoning through trajectory collection and process reward synthesizing. The approach leverages Direct Preference Optimization (DPO) on collected trajectories, which are ranked based on synthesized process rewards. The framework aims to improve the reliability and faithfulness of generated rationales in large language models (LLMs). The method involves collecting trajectories from LLMs, estimating process rewards through offline simulation, and using these rewards to train a process reward model. The policy model is then optimized using DPO to maximize the probability of paths with higher cumulative rewards. The results on logical reasoning benchmarks show that the proposed framework outperforms strong baselines like GPT-3.5-Turbo. The method reduces reliance on human annotations and improves the quality and conciseness of generated rationales. The approach is evaluated on logical and mathematical reasoning tasks, demonstrating its effectiveness in enhancing reasoning capabilities. The framework also includes auto-evaluation of rationale quality using GPT-4 and analysis of predicted rewards. The results show that the proposed method significantly improves the performance of LLMs in reasoning tasks, particularly in logical and mathematical reasoning. The method is efficient and reduces training costs, making it a promising approach for improving LLM reasoning capabilities.This paper proposes a framework for learning planning-based reasoning through trajectory collection and process reward synthesizing. The approach leverages Direct Preference Optimization (DPO) on collected trajectories, which are ranked based on synthesized process rewards. The framework aims to improve the reliability and faithfulness of generated rationales in large language models (LLMs). The method involves collecting trajectories from LLMs, estimating process rewards through offline simulation, and using these rewards to train a process reward model. The policy model is then optimized using DPO to maximize the probability of paths with higher cumulative rewards. The results on logical reasoning benchmarks show that the proposed framework outperforms strong baselines like GPT-3.5-Turbo. The method reduces reliance on human annotations and improves the quality and conciseness of generated rationales. The approach is evaluated on logical and mathematical reasoning tasks, demonstrating its effectiveness in enhancing reasoning capabilities. The framework also includes auto-evaluation of rationale quality using GPT-4 and analysis of predicted rewards. The results show that the proposed method significantly improves the performance of LLMs in reasoning tasks, particularly in logical and mathematical reasoning. The method is efficient and reduces training costs, making it a promising approach for improving LLM reasoning capabilities.
Reach us at info@study.space
[slides] Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing | StudySpace