Understanding ChatQA%3A Surpassing GPT-4 on Conversational QA and RAG

ChatQA is a family of models that outperform GPT-4 in conversational question answering (QA) and retrieval-augmented generation (RAG). The models use a two-stage instruction tuning method to enhance generation and a dense retriever optimized for conversational QA, which is comparable to state-of-the-art query rewriting models but with lower deployment costs. The CHATRAG BENCH, a benchmark with ten datasets covering RAG, table-related QA, arithmetic calculations, and unanswerable scenarios, shows that ChatQA-1.0-70B, based on Llama2, slightly outperforms GPT-4-0613 and GPT-4-Turbo. The Llama3-ChatQA-1.5-70B model surpasses GPT-4-Turbo by 4.4%. The models are open-sourced, including training data, instruction tuning data, CHATRAG BENCH, and retrievers. The study also shows that incorporating unanswerable samples improves the model's ability to handle such cases. The models are evaluated on various datasets, including CHATRAG BENCH, and show strong performance in conversational QA, table reasoning, arithmetic calculations, and unanswerable scenarios. The results demonstrate that ChatQA models can achieve high accuracy without relying on synthetic data from OpenAI GPT models. The models are also compared to other state-of-the-art models, showing that they perform well in various tasks. The study highlights the effectiveness of the two-stage instruction tuning method and the importance of high-quality data in improving model performance.ChatQA is a family of models that outperform GPT-4 in conversational question answering (QA) and retrieval-augmented generation (RAG). The models use a two-stage instruction tuning method to enhance generation and a dense retriever optimized for conversational QA, which is comparable to state-of-the-art query rewriting models but with lower deployment costs. The CHATRAG BENCH, a benchmark with ten datasets covering RAG, table-related QA, arithmetic calculations, and unanswerable scenarios, shows that ChatQA-1.0-70B, based on Llama2, slightly outperforms GPT-4-0613 and GPT-4-Turbo. The Llama3-ChatQA-1.5-70B model surpasses GPT-4-Turbo by 4.4%. The models are open-sourced, including training data, instruction tuning data, CHATRAG BENCH, and retrievers. The study also shows that incorporating unanswerable samples improves the model's ability to handle such cases. The models are evaluated on various datasets, including CHATRAG BENCH, and show strong performance in conversational QA, table reasoning, arithmetic calculations, and unanswerable scenarios. The results demonstrate that ChatQA models can achieve high accuracy without relying on synthetic data from OpenAI GPT models. The models are also compared to other state-of-the-art models, showing that they perform well in various tasks. The study highlights the effectiveness of the two-stage instruction tuning method and the importance of high-quality data in improving model performance.

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

22 May 2024 | Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro