pith. sign in

arxiv: 2605.27596 · v1 · pith:SVFTMSLQnew · submitted 2026-05-26 · 💻 cs.CL

Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

Pith reviewed 2026-06-29 18:25 UTC · model grok-4.3

classification 💻 cs.CL
keywords small language modelsmulti-hop question answeringhallucinationssystem-I system-II reasoninganswer-first reasoningretrieval augmented generationzero-shot answering
0
0 comments X

The pith

Small language models solve multi-hop questions better by answering first with a quick guess then retrieving evidence to refine it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes inverting the usual strategy for small language models on complex questions. Instead of thinking step by step before looking up information, the model first produces a rapid zero-shot answer that may contain hallucinations, then uses that answer to retrieve supporting evidence from a knowledge source, and finally applies deeper reasoning to that evidence. This combination of quick initial response and evidence-guided refinement outperforms prior think-first methods on multi-step question-answering benchmarks. The approach rests on the observation that SLMs are often confident in their early answers and that those answers can usefully steer retrieval even when imperfect.

Core claim

An answer-first-reason-later framework lets the model generate an initial hypothesis through System-I zero-shot answering, retrieve evidence keyed to that hypothesis, and then conduct System-II style deeper reasoning over the retrieved evidence; this chaining produces higher accuracy on multi-step QA tasks than the conventional think-first followed by iterative retrieval pipeline.

What carries the argument

The answer-first-reason-later framework that chains a System-I quick answer to System-II evidence-based reasoning by using the initial hypothesis as the retrieval query.

If this is right

  • Hallucinations in the initial answer can be turned into a retrieval signal rather than treated only as error.
  • SLMs can reach competitive multi-hop performance without first producing an explicit reasoning chain.
  • The same model can switch between fast hypothesis generation and slower evidence refinement in a single pipeline.
  • Hardware-efficient models become viable for tasks previously thought to require larger models or heavier retrieval loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may generalize to other retrieval-augmented tasks where an approximate first output can bootstrap search.
  • Measuring the model's in its initial answer could serve as a trigger for whether to invoke the full retrieval-and-reason step.
  • The inversion suggests that early errors need not cascade if retrieval is conditioned directly on the flawed hypothesis.

Load-bearing premise

Small language models generate initial answers that are sufficiently accurate or confidently structured to serve as effective seeds for retrieving useful evidence.

What would settle it

Running the proposed method and a think-first baseline on the same multi-hop QA benchmarks and finding that the answer-first approach yields lower accuracy would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.27596 by Saptarshi Sengupta, Suhang Wang.

Figure 1
Figure 1. Figure 1: Current methods alternate between reasoning [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. Step 1 (gray provides System-I answer). Steps 2-4 (green) involve System-II operations. 3.2 System-II: Reasoning With Evidence System-II requires a more deliberate and well￾rationed thought process to answer a question. As such, the bulk of the reasoning budget is spent here. This phase consists of three steps: (i) Rea￾soning Triple Generation: Given the initial answer, the SLM explains, u… view at source ↗
Figure 3
Figure 3. Figure 3: Success/failure (left/right) cases of our framework. Further examples contrasting with others in App. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Step-by-step reasoning for Fig [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: System-I reasoning for Fig [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: System-I answer prompt. F Evaluation Details All frameworks are evaluated using the Hugging￾Face implementation (https://github.com/hug gingface/evaluate) of the metrics of the original work by Rajpurkar et al. (2018). G Comparing RAG Methods In Figures 9, 10, and 11, we contrast vanilla RAG and GraphAnchor with our method to show where each method struggles and where we excel. Firstly, standard RAG pulls … view at source ↗
Figure 8
Figure 8. Figure 8: System-II answer prompt. of GraphRAG approaches, where failure to under￾stand relationships can lead to incorrect answers. Finally, in our method, we observe a better cov￾erage of information coming from two retrievals, based on the two generated triples. Using this, the SLM realizes that while Colorado initially played in the Big 12 Conference, they later joined Pac-12. This overrules the initial guess (B… view at source ↗
Figure 7
Figure 7. Figure 7: Triple generation prompt. Prompt For System-II Answering You are a question answering assistant. You are given a question, an initial guess, supporting evidence for that guess (as knowledge graph triples) and retrieved context related to the question. Think about everything step-by-step, by considering all of the information and determining where the flaws are. Provide a clear, structured explanation of yo… view at source ↗
Figure 9
Figure 9. Figure 9: Demonstrating Standard RAG on an example question. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Demonstrating GraphAnchor on an example question. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Demonstrating our method on an example question. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi-step reasoning problems as early mistakes cascade to the final response. To address this, existing works think-first followed by iterative retrieval to reduce hallucination. We argue that the think-first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first-reason later. We propose a cognitively-inspired framework where the model is first allowed to quickly answer the question (System-I (zero-shot)) and then resorts to deeper thinking (System-II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System-I and System-II style thinking, we show that our method can outperform prior work that takes the traditional think-first route on various multi-step question-answering benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript proposes an inversion of the standard think-first pipeline for multi-hop QA with small language models (SLMs). It claims that SLMs are often accurately confident in zero-shot initial answers and that hallucinations can usefully guide retrieval; the method therefore generates an answer-first hypothesis (System-I), retrieves evidence conditioned on that hypothesis, and performs deeper reasoning (System-II) to produce the final answer. The central empirical claim is that this answer-first-reason-later approach outperforms prior think-first baselines on various multi-step question-answering benchmarks.

Significance. If the reported outperformance is robust, the work offers a practically useful alternative to chain-of-thought and iterative retrieval pipelines for resource-limited SLMs. It supplies an empirical demonstration that an initial (possibly hallucinated) hypothesis can improve downstream retrieval and correction, which could reduce reliance on expensive multi-step prompting. No machine-checked proofs or parameter-free derivations are present; the contribution rests on the benchmark comparisons.

minor comments (1)
  1. The abstract states the high-level motivation and outperformance claim but supplies no benchmark names, numeric deltas, ablation controls, or error analysis. If these details appear only in later sections, they should be summarized in the abstract for immediate assessment of the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary and assessment of our work. The referee accurately captures the proposed inversion of the standard pipeline and the central empirical claim. We note that the report lists no specific major comments, so we have no point-by-point responses to provide. We remain available to address any additional questions or clarifications the referee may have.

Circularity Check

0 steps flagged

No significant circularity; empirical claim independent of inputs

full rationale

The paper advances an empirical inversion of think-first pipelines for SLMs on multi-hop QA, asserting that an initial zero-shot answer (System-I) followed by retrieval-based correction (System-II) yields better benchmark results than prior think-first baselines. No equations, fitted parameters, or self-defined quantities appear. The central claim is framed as an experimental finding resting on comparisons to external prior work, with no load-bearing step that reduces by construction to a self-citation chain, ansatz, or renamed input. The provided abstract and description contain no self-referential derivations that would trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work is entirely empirical and draws on cognitive-science terminology without introducing mathematical free parameters, new physical entities, or formal axioms beyond standard NLP assumptions.

axioms (2)
  • domain assumption SLMs hallucinate more frequently than LLMs, impacting multi-step reasoning
    Background premise stated in the opening of the abstract.
  • ad hoc to paper The think-first strategy is not always necessary
    Central argumentative premise used to justify the answer-first inversion.

pith-pipeline@v0.9.1-grok · 5747 in / 1235 out tokens · 39199 ms · 2026-06-29T18:25:25.824135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    InFirst Conference on Language Modeling

    V-STar: Training verifiers for self-taught rea- soners. InFirst Conference on Language Modeling. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, and open questions...

  2. [2]

    Guiding reasoning in small language models with llm assistance

    Prunerag: Confidence-guided query decompo- sition trees for efficient retrieval-augmented gener- ation. InProceedings of the ACM Web Conference 2026, WWW ’26, page 1923–1934, New York, NY , USA. Association for Computing Machinery. Daniel Kahneman. 2011. Thinking, fast and slow.Far- rar, Straus and Giroux. Aditya K. Kamath, Ramya Prabhu, Jayashree Mohan, ...

  3. [3]

    Beibin Li, Yi Zhang, Sébastien Bubeck, Jeevan Pathuri, and Ishai Menache

    Curran Associates, Inc. Beibin Li, Yi Zhang, Sébastien Bubeck, Jeevan Pathuri, and Ishai Menache. 2024a. Small language models for application interactions: A case study.CoRR, abs/2405.20347. Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2024b. The dawn after the dark: An empirical study on factuality hallucinati...

  4. [4]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4554–4570, Bangkok, Thailand

    Factual confidence of LLMs: on reliability and robustness of current estimators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4554–4570, Bangkok, Thailand. Association for Computational Linguistics. Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usma...

  5. [5]

    InFindings of the Association for Computational Lin- guistics: EMNLP 2025, pages 13028–13045, Suzhou, China

    CoAT: Chain-of-associated-thoughts frame- work for enhancing large language models reasoning. InFindings of the Association for Computational Lin- guistics: EMNLP 2025, pages 13028–13045, Suzhou, China. Association for Computational Linguistics. 10 Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. 2025. G...

  6. [6]

    InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 32600–32650, Suzhou, China

    ThinkSLM: Towards reasoning in small lan- guage models. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 32600–32650, Suzhou, China. As- sociation for Computational Linguistics. Chris Alexiuk Tanay Varshney, Annie Surla. 2025. An Easy Introduction to LLM Reasoning, AI Agents, and Test Time Scaling | NVIDIA ...

  7. [7]

    Kimi K2: Open Agentic Intelligence

    Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534. Katherine Tian, Eric Mitchell, Huaxiu Yao, Christo- pher D Manning, and Chelsea Finn. 2024. Fine- tuning language models for factuality. InThe Twelfth International Conference on Learning Representa- tions. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. ...

  8. [8]

    InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, KDD ’25, page 6173–6183, New York, NY , USA

    A survey on small language models in the era of large language models: Architecture, capabili- ties, and trustworthiness. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, KDD ’25, page 6173–6183, New York, NY , USA. Association for Computing Machinery. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei ...

  9. [9]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Haiyang Yang, Ziv Carmon, Dan Ariely, and Michael I Norton. 2019. The feeling of not knowing it all. Journal of Consumer Psychology, 29(3):455–462. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D. Manning. 2018. HotpotQA: A dataset for div...

  10. [10]

    A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

    Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc. Chuanyue Yu, Kuo Zhao, Yuhan Li, Heng Chang, Mingjian Feng, Xiangzhe Jiang, Yufei Sun, Jia Li, Yuzhi Zhang, Jianxin Li, and 1 others. 2026. Graphrag-r1: Graph retrieval-augmente...

  11. [11]

    **Identify the Assassin’s Creed Games**: List all the major Assassin’s Creed games released so far

  12. [12]

    **Research Each Game’s Setting**: For each game on your list, find out where the game takes place

  13. [13]

    **Identify the Game Set in London**: From the research done in step 2, identify which game is set in the city of London

  14. [14]

    Assassin’s Creed IV: Black Flag

    **Provide the Answer**: Once you have iden- tified the correct game, provide the answer to the question. Following these steps will lead you to the correct answer. Figure 4: Step-by-step reasoning for Fig. 1 question. Answer First The major Assassins Creed game set in the city of London is "Assassin’s Creed IV: Black Flag" with a significant side mission ...

  15. [15]

    The idea is simple: given a question, retrieve the top-k documents from a knowledge base that are relevant to it

    Standard RAG(Lewis et al., 2020): This is the original RAG method that is used as a fixed baseline in any related study. The idea is simple: given a question, retrieve the top-k documents from a knowledge base that are relevant to it. These documents are included in the prompt for the model (LLM/SLM) as grounding context for the question. While this showe...

  16. [16]

    The idea behind doing this was to create a knowl- edge graph that broadly covers as much infor- mation as possible

    HyperGraphRAG(Luo et al., 2025b): In- spired by GraphRAG (Peng et al., 2025), this method constructs ahypergraph, i.e., a graph where edges connect multiple vertices. The idea behind doing this was to create a knowl- edge graph that broadly covers as much infor- mation as possible. However, in our experi- ments, we noticed that this method is incred- ibly...

  17. [17]

    GraphAnchor(Liu et al., 2026): Another variant of GraphRAG. Here, the framework uses anevolvingknowledge-graph, i.e., at each step, a small, focused knowledge-graph, based on the retrieved documents, is devel- oped, which keeps track of the entities and relationships in them. This graph is used to augment the retrieved documents and provide a relational u...

  18. [18]

    PruneRAG(Jiao et al., 2026): Grounded in tree-based reasoning(Yao et al., 2023), i.e., exploring multiple search spaces to determine an answer, PruneRAG performs three opera- tions: (i) query decomposition and tree con- 13 struction, based on a model’s reasoning us- ing retrieved context; (ii) backtracking on the branches to aggregate evidence, (iii) toke...

  19. [19]

    opti- mal

    RT-RAG(Reasoning Tree Guided RAG) (Shi et al., 2026): Similar to PruneRAG, RT-RAG also performs tree-style reasoning. It gener- ates multiple candidate trees, selects the “opti- mal” one based on tree statistics such as depth and number of nodes, and, finally, gathers evi- dence based on the optimal tree to provide an answer

  20. [20]

    cognitive outline

    PAGER(Li et al., 2026): This method first creates a “cognitive outline” (called “page”) of what pieces of information a model needs to answer a question. The page has “slots” which are placeholders for potentially re- quired data. These slots are progressively filled with iterative retrieval. Finally, once the entire page has been populated, the model pro...

  21. [21]

    Search-o1(Li et al., 2025a): Inspired by Agentic RAG (Singh et al., 2025), i.e., let- ting an LLM-agent dictate what to retrieve instead of providing it beforehand with top-k documents. Search-o1 differs from the others in two ways: (i) performingwebsearches to retrieve background data, instead of using a pre-defined knowledge index, and (ii) includ- ing ...

  22. [22]

    Provide ONLY the specific answer

  23. [23]

    The answer is

    Do not include introductory phrases (e.g., "The answer is..."), explanations, or context

  24. [24]

    The answer must be wrapped in <answer> tags inside an <output> block

  25. [25]

    Output ONLY the <output> block. ### Example: <input> Question: What is the capital of France? </input> <output> <answer>Paris</answer> </output> ### Task: Process the following input and provide the answer within an <output> block. Figure 6: System-I answer prompt. F Evaluation Details All frameworks are evaluated using the Hugging- Face implementation (h...

  26. [26]

    Format: Each triple must be enclosed in <triple> tags using the structure: <triple>Subject | predicate_link | Object</triple>

  27. [27]

    X’s lead singer

    Logical Depth: Do not skip steps. If the question involves a specific role or relationship (e.g., "X’s lead singer" or "Y’s director"), you MUST identify that specific individual as a separate node before linking them to the final answer

  28. [28]

    Chain of Reasoning: The sequence must form a step-by-step path where the Object of one triple leads to the Subject of the next

  29. [29]

    Predicate Style: Use concise, lowercase, snake_case for predicates

  30. [30]

    Do not include introductory text or explanations

    Strict Output: Provide ONLY the <output> block. Do not include introductory text or explanations. ### Examples: <input> Question: Where was the lead singer of the band Queen born? Answer: Stone Town, Zanzibar </input> <output> <triple>Queen | has_lead_singer | Freddie Mercury</triple> <triple>Freddie Mercury | born_in | Stone Town, Zanzibar</triple> </out...

  31. [35]

    · · · To answer this question, let’s break down the information provided: 1.Justin Bannan played college football at Colorado

    The Carroll Fighting Saint football program represents Carroll College of Helena, Montana in college football ... · · · To answer this question, let’s break down the information provided: 1.Justin Bannan played college football at Colorado

  32. [36]

    From the given information, we can infer that the current team representing Colorado is part of the Big 12 Conference, as Colorado is a member of the Big 12 Conference

    We need to find out which conference the current team representing Colorado is a member of. From the given information, we can infer that the current team representing Colorado is part of the Big 12 Conference, as Colorado is a member of the Big 12 Conference. Therefore, the answer is the Big 12 Conference. <final_answer> Big 12 Conference </final_answer>...

  33. [38]

    Justin Harper (born February 24, 1985 in Catawba, North Carolina) is an American college football coach and former professional Canadian football wide receiver

  34. [39]

    Timothy Joseph "Tim" Landis (born July 13, 1964) is an American football coach who is currently quarterbacks coach and special teams coordinator at Lycoming College

  35. [41]

    · · · <think> Based on the question, we need to identify the current conference of the team for which Justin Bannan played college football

    The Carroll Fighting Saint football program represents Carroll College of Helena, Montana in college football ... · · · <think> Based on the question, we need to identify the current conference of the team for which Justin Bannan played college football. From the given documents, we can find the following information: -Justin Bannan played college footbal...

  36. [42]

    He was drafted by the Buffalo Bills in the fifth round of the 2002 NFL Draft

    Justin Lewis Bannan (born April 18, 1979) is a former American football defensive tackle. He was drafted by the Buffalo Bills in the fifth round of the 2002 NFL Draft. He played college football at Colorado

  37. [43]

    Benjamin "Bennie" Oosterbaan (February 4, 1906 – October 25, 1990)

  38. [44]

    Timothy Joseph "Tim" Landis

  39. [45]

    Justin Harper (born February 24, 1985 in Catawba, North Carolina) is

  40. [46]

    Judson Albert "Jud" Timm (August 28, 1906 – December 23, 1994)

  41. [47]

    The Trenton Lightning were an indoor

  42. [48]

    The 2013 Texas Longhorns football team

  43. [49]

    It was also the final season as members the Big 12 Conference in the North Division for Colorado, before joining the Pac-12 Conference for the 2011 season

    Colorado Buffaloes ... It was also the final season as members the Big 12 Conference in the North Division for Colorado, before joining the Pac-12 Conference for the 2011 season

  44. [50]

    Roy Randolph Roundtree

  45. [51]

    · · · Let’s break down the information step-by-step:

    The 2015 Wisconsin Badgers football ... · · · Let’s break down the information step-by-step:

  46. [52]

    From the provided information, we know that Justin Bannan played college football for the Colorado Buffaloes

    **Identify the Team**: The question asks about the team for which Justin Bannan played college football. From the provided information, we know that Justin Bannan played college football for the Colorado Buffaloes

  47. [53]

    From the provided information, we know that the Colorado Buffaloes were members of the Big 12 Conference until the 2011 season when they joined the Pac-12 Conference

    **Determine the Current Conference**: The question asks about the current conference of the team for which Justin Bannan played college football. From the provided information, we know that the Colorado Buffaloes were members of the Big 12 Conference until the 2011 season when they joined the Pac-12 Conference. Therefore, the current conference of the tea...