26 Feb 2024 | Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, Sebastian Riedel
The paper investigates whether Large Language Models (LLMs) perform latent multi-hop reasoning with complex prompts. Specifically, it examines if LLMs can identify a "bridge entity" (e.g., Stevie Wonder) and use its knowledge to complete a two-hop prompt, such as "The mother of the singer of 'Superstition' is." The authors introduce the TWOHOPFACT dataset, which consists of 45,595 two-hop prompts of 52 fact composition types, and propose metrics to measure internal entity recall and consistency scores. They find strong evidence of latent multi-hop reasoning for certain fact composition types, with the reasoning pathway used in more than 80% of the prompts. However, the utilization is highly contextual and varies across different prompt types. The evidence for the second hop and overall multi-hop traversal is moderate, with only substantial evidence for the first hop. Additionally, there is a clear scaling trend with increasing model size for the first hop but not for the second hop. The findings suggest potential challenges and opportunities for future development and applications of LLMs.The paper investigates whether Large Language Models (LLMs) perform latent multi-hop reasoning with complex prompts. Specifically, it examines if LLMs can identify a "bridge entity" (e.g., Stevie Wonder) and use its knowledge to complete a two-hop prompt, such as "The mother of the singer of 'Superstition' is." The authors introduce the TWOHOPFACT dataset, which consists of 45,595 two-hop prompts of 52 fact composition types, and propose metrics to measure internal entity recall and consistency scores. They find strong evidence of latent multi-hop reasoning for certain fact composition types, with the reasoning pathway used in more than 80% of the prompts. However, the utilization is highly contextual and varies across different prompt types. The evidence for the second hop and overall multi-hop traversal is moderate, with only substantial evidence for the first hop. Additionally, there is a clear scaling trend with increasing model size for the first hop but not for the second hop. The findings suggest potential challenges and opportunities for future development and applications of LLMs.