26 Feb 2024 | Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, Sebastian Riedel
This paper investigates whether large language models (LLMs) perform latent multi-hop reasoning when given complex prompts. The study focuses on prompts that require two-step reasoning, such as "The mother of the singer of 'Superstition' is". The researchers analyze whether LLMs can identify the bridge entity (e.g., Stevie Wonder) in the first hop and then use knowledge about that entity to complete the prompt in the second hop.
The study introduces the TwoHop-FACT dataset, which contains 45,595 two-hop prompts of 52 fact composition types. The researchers test how often LLMs process these prompts using a latent two-hop reasoning pathway. They define two metrics: internal entity recall score and consistency score. The internal entity recall score measures how well an LLM recalls the bridge entity when it is indirectly mentioned in the prompt. The consistency score measures how well the LLM uses its knowledge about the bridge entity to complete the prompt.
The results show that LLMs do perform latent multi-hop reasoning for certain types of prompts, with the reasoning pathway used in more than 80% of the cases. However, the utilization is highly contextual, varying across different types of prompts. The evidence for the second hop and the full multi-hop traversal is moderate, with only substantial evidence for the first hop. The study also finds a clear scaling trend with increasing model size for the first hop of reasoning but not for the second hop.
The findings suggest potential challenges and opportunities for future development and applications of LLMs. The study highlights the importance of understanding how LLMs process complex prompts and the potential for improving their reasoning capabilities through better training and model design.This paper investigates whether large language models (LLMs) perform latent multi-hop reasoning when given complex prompts. The study focuses on prompts that require two-step reasoning, such as "The mother of the singer of 'Superstition' is". The researchers analyze whether LLMs can identify the bridge entity (e.g., Stevie Wonder) in the first hop and then use knowledge about that entity to complete the prompt in the second hop.
The study introduces the TwoHop-FACT dataset, which contains 45,595 two-hop prompts of 52 fact composition types. The researchers test how often LLMs process these prompts using a latent two-hop reasoning pathway. They define two metrics: internal entity recall score and consistency score. The internal entity recall score measures how well an LLM recalls the bridge entity when it is indirectly mentioned in the prompt. The consistency score measures how well the LLM uses its knowledge about the bridge entity to complete the prompt.
The results show that LLMs do perform latent multi-hop reasoning for certain types of prompts, with the reasoning pathway used in more than 80% of the cases. However, the utilization is highly contextual, varying across different types of prompts. The evidence for the second hop and the full multi-hop traversal is moderate, with only substantial evidence for the first hop. The study also finds a clear scaling trend with increasing model size for the first hop of reasoning but not for the second hop.
The findings suggest potential challenges and opportunities for future development and applications of LLMs. The study highlights the importance of understanding how LLMs process complex prompts and the potential for improving their reasoning capabilities through better training and model design.