Understanding Evaluation of RAG Metrics for Question Answering in the Telecom Domain

This paper evaluates the effectiveness of Retrieval Augmented Generation (RAG) systems for Question Answering (QA) in the telecom domain, focusing on the RAG Assessment (RAGAS) framework. The study aims to enhance the RAGAS package by capturing intermediate outputs and analyzing the metrics used for evaluation. The research questions explore how RAGAS metrics perform in the telecom domain, their appropriateness for evaluating QA tasks, and the impact of retriever performance, domain-adapted embeddings, and instruction-tuned LLMs. The experimental setup uses a subset of TeleQuAD, a telecom domain QA dataset, and evaluates multiple retriever models, including pre-trained and fine-tuned variants. The generator module uses LLMs like Mistral-7b and GPT3.5. The study focuses on six metrics from RAGAS: Faithfulness, Context Relevance, Answer Relevance, Answer Similarity, Factual Correctness, and Answer Correctness. Key findings include: - Faithfulness and Factual Correctness are found to be good indicators of response correctness, with higher values for correct retrieval. - Domain adaptation improves these metrics, indicating that the generator LLMs can better answer questions from enriched domain knowledge. - Instruction fine-tuning of the generator LLMs enhances the metrics, particularly Factual Correctness. - Context Relevance and Answer Relevance are less reliable due to their sensitivity to context length and the choice of LLMs, respectively. The study concludes that Faithfulness and Factual Correctness are well aligned with human expert judgment and can be used reliably in an end-to-end RAG pipeline. The enhanced RAGAS package and detailed analysis provide a starting point for further research and practical applications in telecom QA.This paper evaluates the effectiveness of Retrieval Augmented Generation (RAG) systems for Question Answering (QA) in the telecom domain, focusing on the RAG Assessment (RAGAS) framework. The study aims to enhance the RAGAS package by capturing intermediate outputs and analyzing the metrics used for evaluation. The research questions explore how RAGAS metrics perform in the telecom domain, their appropriateness for evaluating QA tasks, and the impact of retriever performance, domain-adapted embeddings, and instruction-tuned LLMs. The experimental setup uses a subset of TeleQuAD, a telecom domain QA dataset, and evaluates multiple retriever models, including pre-trained and fine-tuned variants. The generator module uses LLMs like Mistral-7b and GPT3.5. The study focuses on six metrics from RAGAS: Faithfulness, Context Relevance, Answer Relevance, Answer Similarity, Factual Correctness, and Answer Correctness. Key findings include: - Faithfulness and Factual Correctness are found to be good indicators of response correctness, with higher values for correct retrieval. - Domain adaptation improves these metrics, indicating that the generator LLMs can better answer questions from enriched domain knowledge. - Instruction fine-tuning of the generator LLMs enhances the metrics, particularly Factual Correctness. - Context Relevance and Answer Relevance are less reliable due to their sensitivity to context length and the choice of LLMs, respectively. The study concludes that Faithfulness and Factual Correctness are well aligned with human expert judgment and can be used reliably in an end-to-end RAG pipeline. The enhanced RAGAS package and detailed analysis provide a starting point for further research and practical applications in telecom QA.

Evaluation of RAG Metrics for Question Answering in the Telecom Domain

15 Jul 2024 | Sujoy Roychowdhury * 1 Sumit Soman * 1 H G Ranjani * 1 Neeraj Gunda 2 Vansh Chhabra 2 Sai Krishna Bala 1