23 Apr 2024 | Amir Saeidi, Shivanshu Verma, Chitta Baral
This paper evaluates the performance of Direct Preference Optimization (DPO) and its variants across multiple tasks, including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding. The study investigates the effectiveness of alignment methods under three scenarios: (1) using Supervised Fine-Tuning (SFT), (2) skipping SFT, and (3) skipping SFT and using an instruction-tuned model. The evaluation covers 13 benchmarks, including MT-Bench, Big Bench, and Open LLM Leaderboard.
Key findings show that alignment methods perform best with smaller training data subsets and demonstrate strong performance in mathematical problem-solving and truthfulness. However, they show limited effectiveness in reasoning tasks. KTO outperforms other methods in most benchmarks, including MT-Bench, and performs well in mathematical problem-solving. CPO and IPO also show strong performance, particularly when using smaller training data. The study highlights that instruction-tuned models significantly impact truthfulness, while SFT remains effective for multi-task understanding.
The results indicate that skipping the SFT phase can lead to improved performance in certain tasks, such as TruthfulQA and GSM8K. However, SFT remains superior for multi-task understanding. The study also finds that alignment methods are sensitive to the volume of training data, with smaller datasets yielding better results. Overall, the findings suggest that alignment methods can achieve strong performance in various tasks, but their effectiveness depends on the specific task and training data. The study contributes to the understanding of alignment methods and their potential for improving model performance in alignment challenges.This paper evaluates the performance of Direct Preference Optimization (DPO) and its variants across multiple tasks, including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding. The study investigates the effectiveness of alignment methods under three scenarios: (1) using Supervised Fine-Tuning (SFT), (2) skipping SFT, and (3) skipping SFT and using an instruction-tuned model. The evaluation covers 13 benchmarks, including MT-Bench, Big Bench, and Open LLM Leaderboard.
Key findings show that alignment methods perform best with smaller training data subsets and demonstrate strong performance in mathematical problem-solving and truthfulness. However, they show limited effectiveness in reasoning tasks. KTO outperforms other methods in most benchmarks, including MT-Bench, and performs well in mathematical problem-solving. CPO and IPO also show strong performance, particularly when using smaller training data. The study highlights that instruction-tuned models significantly impact truthfulness, while SFT remains effective for multi-task understanding.
The results indicate that skipping the SFT phase can lead to improved performance in certain tasks, such as TruthfulQA and GSM8K. However, SFT remains superior for multi-task understanding. The study also finds that alignment methods are sensitive to the volume of training data, with smaller datasets yielding better results. Overall, the findings suggest that alignment methods can achieve strong performance in various tasks, but their effectiveness depends on the specific task and training data. The study contributes to the understanding of alignment methods and their potential for improving model performance in alignment challenges.