7 Mar 2024 | Heydar Soudani, Evangelos Kanoulas, Faegheh Hasibi
This paper evaluates the effectiveness of Retrieval Augmented Generation (RAG) and fine-tuning (FT) in enhancing large language models (LLMs) for handling less popular or low-frequency entities in question answering (QA) tasks. The study focuses on the PopQA dataset, which includes questions covering a long-tail entity distribution, and measures the impact of these approaches on the performance of LLMs across different popularity levels of entities.
Key findings include:
1. **Fine-Tuning (FT)**: Significantly improves accuracy across all entity popularity levels, especially in the most and least popular categories.
2. **Retrieval Augmented Generation (RAG)**: Outperforms FT, particularly when combined with FT in smaller models, though this advantage diminishes in larger models.
3. **Model Size and Retrieval Models**: Larger models show less improvement with FT compared to smaller models.
4. **Data Augmentation**: The quality of synthetic data generated through data augmentation methods significantly affects performance.
5. **Retrieval Model Performance**: Higher performance of retrieval models leads to better overall QA accuracy.
The study concludes that while both RAG and FT are effective, RAG, especially when combined with FT, is more beneficial for handling less popular knowledge. The success of these approaches is also influenced by advancements in retrieval and data augmentation techniques. The code and data for the experiments are available at <https://github.com/informagi/RAGvsFT>.This paper evaluates the effectiveness of Retrieval Augmented Generation (RAG) and fine-tuning (FT) in enhancing large language models (LLMs) for handling less popular or low-frequency entities in question answering (QA) tasks. The study focuses on the PopQA dataset, which includes questions covering a long-tail entity distribution, and measures the impact of these approaches on the performance of LLMs across different popularity levels of entities.
Key findings include:
1. **Fine-Tuning (FT)**: Significantly improves accuracy across all entity popularity levels, especially in the most and least popular categories.
2. **Retrieval Augmented Generation (RAG)**: Outperforms FT, particularly when combined with FT in smaller models, though this advantage diminishes in larger models.
3. **Model Size and Retrieval Models**: Larger models show less improvement with FT compared to smaller models.
4. **Data Augmentation**: The quality of synthetic data generated through data augmentation methods significantly affects performance.
5. **Retrieval Model Performance**: Higher performance of retrieval models leads to better overall QA accuracy.
The study concludes that while both RAG and FT are effective, RAG, especially when combined with FT, is more beneficial for handling less popular knowledge. The success of these approaches is also influenced by advancements in retrieval and data augmentation techniques. The code and data for the experiments are available at <https://github.com/informagi/RAGvsFT>.