2024 | Sujoy Roychowdhury, Sumit Soman, H G Ranjani, Neeraj Gunda, Vansh Chhabra, Sai Krishna Bala
This paper evaluates RAG metrics for Question Answering (QA) in the telecom domain. RAG, which combines retrieval and generation, is widely used for QA tasks. However, evaluating RAG responses remains challenging, especially in specialized domains like telecom. The RAGAS framework is a popular tool for evaluating RAG responses, but it lacks transparency in how its metrics are derived. This study modifies RAGAS to provide intermediate outputs for several metrics, including faithfulness, context relevance, answer relevance, answer similarity, factual correctness, and answer correctness.
The study analyzes the performance of these metrics in the telecom domain, focusing on how they behave under correct and incorrect retrieval. It also examines the impact of domain adaptation on these metrics. The results show that some metrics, such as factual correctness and faithfulness, are good indicators of RAG response quality. However, other metrics, like answer relevance and context relevance, are less reliable and may not accurately reflect the quality of the response.
The study uses a telecom QA dataset, TeleQuAD, and evaluates different retriever and generator models. It finds that instruction fine-tuning of the generator model improves the performance of some metrics. The study also highlights the limitations of using cosine similarity for measuring answer similarity, as it may not accurately reflect semantic similarity.
The results indicate that the metrics FaiFul and FacCor are best aligned with human expert judgment. These metrics are more reliable for evaluating RAG responses in the telecom domain. The study concludes that while RAGAS metrics can be useful for evaluating RAG responses, they have limitations, particularly in the context of domain-specific terminology. The study also suggests that further research is needed to improve the evaluation of RAG systems in specialized domains.This paper evaluates RAG metrics for Question Answering (QA) in the telecom domain. RAG, which combines retrieval and generation, is widely used for QA tasks. However, evaluating RAG responses remains challenging, especially in specialized domains like telecom. The RAGAS framework is a popular tool for evaluating RAG responses, but it lacks transparency in how its metrics are derived. This study modifies RAGAS to provide intermediate outputs for several metrics, including faithfulness, context relevance, answer relevance, answer similarity, factual correctness, and answer correctness.
The study analyzes the performance of these metrics in the telecom domain, focusing on how they behave under correct and incorrect retrieval. It also examines the impact of domain adaptation on these metrics. The results show that some metrics, such as factual correctness and faithfulness, are good indicators of RAG response quality. However, other metrics, like answer relevance and context relevance, are less reliable and may not accurately reflect the quality of the response.
The study uses a telecom QA dataset, TeleQuAD, and evaluates different retriever and generator models. It finds that instruction fine-tuning of the generator model improves the performance of some metrics. The study also highlights the limitations of using cosine similarity for measuring answer similarity, as it may not accurately reflect semantic similarity.
The results indicate that the metrics FaiFul and FacCor are best aligned with human expert judgment. These metrics are more reliable for evaluating RAG responses in the telecom domain. The study concludes that while RAGAS metrics can be useful for evaluating RAG responses, they have limitations, particularly in the context of domain-specific terminology. The study also suggests that further research is needed to improve the evaluation of RAG systems in specialized domains.