This paper investigates the effectiveness of multiple-choice questions (MCQs) in evaluating large language models (LLMs) and compares them with long-form generation questions (LFGQs). The authors conduct experiments on four datasets in both Chinese and English to assess the impact of MCQs on LLMs' performance. Key findings include:
1. **Order Sensitivity**: LLMs exhibit a strong preference for specific positions of answers in MCQs, particularly favoring the first position.
2. **Low Correlation**: Answers from MCQs and LFGQs for identical questions show a low correlation, indicating that MCQs do not accurately reflect LFGQ performance.
3. **Consistency and Accuracy**: Higher consistency does not necessarily indicate better accuracy. LLMs can be consistent in their responses but still provide incorrect answers.
4. **Misalignment in Embedding Space**: The misalignment between MCQs and LFGQs is evident not only in evaluation performance but also in the embedding space.
The authors recommend using LFGQs as the primary evaluation format for LLMs, as they better align with real-world use cases and provide more accurate assessments. They also suggest that the choice of evaluation format should be aligned with the type of knowledge being evaluated, and that consistency should not be the sole metric for evaluating LLMs.This paper investigates the effectiveness of multiple-choice questions (MCQs) in evaluating large language models (LLMs) and compares them with long-form generation questions (LFGQs). The authors conduct experiments on four datasets in both Chinese and English to assess the impact of MCQs on LLMs' performance. Key findings include:
1. **Order Sensitivity**: LLMs exhibit a strong preference for specific positions of answers in MCQs, particularly favoring the first position.
2. **Low Correlation**: Answers from MCQs and LFGQs for identical questions show a low correlation, indicating that MCQs do not accurately reflect LFGQ performance.
3. **Consistency and Accuracy**: Higher consistency does not necessarily indicate better accuracy. LLMs can be consistent in their responses but still provide incorrect answers.
4. **Misalignment in Embedding Space**: The misalignment between MCQs and LFGQs is evident not only in evaluation performance but also in the embedding space.
The authors recommend using LFGQs as the primary evaluation format for LLMs, as they better align with real-world use cases and provide more accurate assessments. They also suggest that the choice of evaluation format should be aligned with the type of knowledge being evaluated, and that consistency should not be the sole metric for evaluating LLMs.