Examining Two Hop Reasoning Through Information Content Scaling

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work investigates reasoning bottlenecks in Transformer models for implicit two-hop question answering (e.g., “Who is Bob’s mother’s employer?”). To isolate compositional reasoning from memorization, the authors construct a controlled two-hop QA dataset and employ scaling-law analysis coupled with knowledge-capacity modeling. They find that small models predominantly rely on answer memorization rather than functional composition, necessitating redundant learning of the same fact across hops; in contrast, chain-of-thought prompting substantially alleviates this deficiency. Crucially, the study provides the first quantitative characterization of the critical model-size threshold at which compositional reasoning overtakes memorization—revealing a nonlinear, phase-transition-like emergence of reasoning capability as parameter count increases. These results establish capacity scaling as a novel interpretability tool: it not only identifies sharp, scale-dependent transitions in reasoning competence but also offers theoretical grounding and empirical validation for understanding how large language models shift from memorization-dominated to reasoning-capable behavior.

Technology Category

Application Category

📝 Abstract

Prior work has found that transformers have an inconsistent ability to learn to answer latent two-hop questions -- questions of the form"Who is Bob's mother's boss?"We study why this is the case by examining how transformers' capacity to learn datasets of two-hop questions and answers (two-hop QA) scales with their size, motivated by prior work on transformer knowledge capacity for simple factual memorization. We find that capacity scaling and generalization both support the hypothesis that latent two-hop QA requires transformers to learn each fact twice, while two-hop QA with chain of thought does not. We also show that with appropriate dataset parameters, it is possible to"trap"very small models in a regime where they memorize answers to two-hop questions independently, even though they would perform better if they could learn to answer them with function composition. Our findings show that measurement of capacity scaling can complement existing interpretability methods, though there are challenges in using it for this purpose.

Problem

Research questions and friction points this paper is trying to address.

Transformers' inconsistent two-hop reasoning ability.

Examining capacity scaling for two-hop question answering.

Challenges in using scaling to enhance interpretability methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers learn two-hop QA

Dataset parameters trap small models

Capacity scaling complements interpretability methods

🔎 Similar Papers

Do Large Language Models Latently Perform Multi-Hop Reasoning?