Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation

Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation

4 Jun 2024 | Haonan Chen, Zhicheng Dou, Kelong Mao, Jiongnan Liu, Ziliang Zhao
This paper addresses the challenge of conversational dense retrieval, which aims to retrieve relevant passages from multi-turn natural language contexts. Existing models often struggle with data sparsity and the variability in conversational formats, leading to poor generalization in real-world scenarios. To tackle this, the authors propose CONVAUG, a framework that uses LLMs to generate multi-level augmented conversations. The framework includes a cognition-aware prompting process to mitigate false positives, false negatives, and hallucinations, and a difficulty-adaptive sample filter to select challenging samples for complex conversations. A contrastive learning objective is employed to train a robust conversational context encoder. Extensive experiments on four public datasets demonstrate the effectiveness, generalizability, and applicability of CONVAUG. The contributions of the work include an LLM-based multi-level data augmentation framework, a cognition-aware prompting process, and a difficulty-adaptive sample filter. The code for CONVAUG is available at <https://github.com/haon-chen/ConvAug>.This paper addresses the challenge of conversational dense retrieval, which aims to retrieve relevant passages from multi-turn natural language contexts. Existing models often struggle with data sparsity and the variability in conversational formats, leading to poor generalization in real-world scenarios. To tackle this, the authors propose CONVAUG, a framework that uses LLMs to generate multi-level augmented conversations. The framework includes a cognition-aware prompting process to mitigate false positives, false negatives, and hallucinations, and a difficulty-adaptive sample filter to select challenging samples for complex conversations. A contrastive learning objective is employed to train a robust conversational context encoder. Extensive experiments on four public datasets demonstrate the effectiveness, generalizability, and applicability of CONVAUG. The contributions of the work include an LLM-based multi-level data augmentation framework, a cognition-aware prompting process, and a difficulty-adaptive sample filter. The code for CONVAUG is available at <https://github.com/haon-chen/ConvAug>.
Reach us at info@study.space