NeuReasoner: Theory-grounded Mapping of Reasoning Elicitation Boundaries

Aydin Javadov; Bjoern Schuller; Florian von Wangenheim; Joseph Ollier; Shyngys Aitkazinov; Tobias Hoesli

arxiv: 2606.29971 · v1 · pith:F26TTMZBnew · submitted 2026-06-29 · 💻 cs.LG

NeuReasoner: Theory-grounded Mapping of Reasoning Elicitation Boundaries

Aydin Javadov , Shyngys Aitkazinov , Tobias Hoesli , Florian von Wangenheim , Bjoern Schuller , Joseph Ollier This is my paper

Pith reviewed 2026-06-30 07:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords reasoning elicitationlarge language modelscognitive benchmarkslatent reasoningtheory-grounded methodsBayesian reasoningrisk takingarithmetic reasoning

0 comments

The pith

NeuReasoner elicits latent reasoning to match thinking-mode performance on arithmetic, code, Bayesian, and reward tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether reasoning abilities in large language models are mostly latent in base models and can be recovered through elicitation rather than added during post-training. It introduces NeuReasoner, which at each step pairs a neuro-inspired functional lens with a cognitive lens from erotetic theory inside one model. This approach matches or beats thinking-mode baselines on arithmetic reasoning, code generation, Bayesian reasoning, and reward learning when models are large enough. The gains hold up against other methods using the same number of model calls. It also identifies tasks like risk-taking under uncertainty where elicitation does not work well, showing the boundaries of what can be recovered.

Core claim

At sufficient scale, NeuReasoner matches or exceeds thinking-mode baselines on arithmetic reasoning, code generation, Bayesian reasoning, and reward learning. These gains persist against self-consistency and iterative-refinement baselines matched to NeuReasoner's per-decision call budget. Using NeuReasoner reveals clear boundaries where elicitation fails, such as risk-taking and decision making under uncertainty, and demonstrates that model scale can interact with elicitation by both widening its advantage on some tasks and erasing it on others.

What carries the argument

NeuReasoner, which orchestrates pairing of a Neuro Lens inspired by functional specificity with a Cognitive Lens from the Erotetic Theory of Reasoning, integrating outputs via internal modularization in a single model without external tools.

If this is right

NeuReasoner achieves matching or superior performance on arithmetic reasoning, code generation, Bayesian reasoning, and reward learning compared to thinking-mode baselines.
Gains from NeuReasoner hold against self-consistency and iterative-refinement methods when controlling for the number of model calls.
Risk-taking and decision making under uncertainty cannot be reliably recovered through elicitation alone.
Model scale can both increase the benefits of elicitation on some cognitive tasks and reduce them on others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar modular elicitation techniques might be applied to other domains like planning or causal inference to test if they are also latent.
The findings suggest that training data for post-training could focus on the tasks where elicitation fails rather than on recoverable ones.
Internal modularization without tools may offer a more efficient path to reasoning than external tool use for certain tasks.

Load-bearing premise

The orchestrator can reliably pair the Neuro Lens with the Cognitive Lens and integrate their outputs through internal modularization of a single model without needing external tools.

What would settle it

A result where NeuReasoner at larger scales still falls short of thinking-mode performance on arithmetic reasoning or where it recovers performance on risk-taking tasks.

Figures

Figures reproduced from arXiv: 2606.29971 by Aydin Javadov, Bjoern Schuller, Florian von Wangenheim, Joseph Ollier, Shyngys Aitkazinov, Tobias Hoesli.

**Figure 1.** Figure 1: Overview of the NeuReasoner. (a) A CogBench (Coda-Forno et al., 2024) experiment runs as a sequence of stages, each presenting one question to solve; we refer to each such stage as a node. (b) Within a single node (Stage k), the LLM acts as an orchestrator and, at each step, pairs one Neuro Lens with one Cognitive Lens. The two lenses are executed in isolation against the original question, and their struc… view at source ↗

**Figure 3.** Figure 3: From a vanilla model, post-training is the es [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Normalized performance (random = 0, human = 1) across six CogBench performance tasks comparing [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Brain-lens × cognitive-inquiry cooccurrence, summed across models. Each heatmap shows the percentage of reasoning steps that used each (Neuro Lens, Cognitive Lens) pair, broken down by model and CogBench experiment. Cell values ≥ 1% are annotated. The concentration of mass in one or two cells per model confirms that models do not distribute their use across the full theoretical catalog. to what extent doe… view at source ↗

**Figure 6.** Figure 6: Tool-ablation summary. Mean leave-one-out ∆ = norm_perf ablated − norm_perffull, pooled across Qwen3-8B + Qwen3-32B and across all CogBench tasks. More negative values indicate a larger contribution of that tool to elicited reasoning. 8B 14B 32B 0 20 40 60 80 100 % of steps PR 8B 14B 32B TD 8B 14B 32B HT 8B 14B 32B IL 8B 14B 32B RB 8B 14B 32B BART 8B 14B 32B TST Brain-lens selection across experiments Neu… view at source ↗

**Figure 8.** Figure 8: Cognitive phenotype profiles — Qwen3 family (C1 / C2 / C3). Radar plots show all ten CogBench behavioral dimensions (random = 0, human-average = 1, amber ring) for Qwen3-{8B, 14B, 32B} under three conditions: C1 vanilla thinking off (solid grey), C2 RL-trained thinking on (dashed amber), and C3 NeuReasoner, thinking off (dotted green). C2 and C3 produce similar profiles on deliberation-heavy dimensions (B… view at source ↗

**Figure 9.** Figure 9: Math and code benchmark results — Qwen3-32B (C1 / C2 / C3). Grouped bars show Pass@1 accuracy (%) across four tasks for three conditions: C1 Thinking off (■ grey, vanilla), C2 Thinking onRL (■ orange, RL-trained), and C3 NeuReasoner (■ purple, thinking off). Error bars show ± SEM across K = 3–4 repetitions. NeuReasoner leads on MATH-500 (82.2 % vs. 79.7 %) and matches thinking-on on AMC (88.6 % vs. 86.3 %)… view at source ↗

**Figure 10.** Figure 10: Brain-lens selection — Qwen3 family, all experiments. Stacked bars show the percentage of reasoning steps allocated to each Neuro Lens (Lang=Language Network, MD=Multiple-Demand, ToM=Theory-of-Mind, DMN=Default-Mode Network), pooled across Qwen3-{8B, 14B, 32B} for each CogBench task. The MultipleDemand lens dominates in all seven experiments. Per-model breakdowns are in [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 11.** Figure 11: Cognitive-inquiry operator selection — Qwen3 family, all experiments. Stacked bars show the percentage of steps assigned to each Cognitive Lens (Surface=Surface Issue, Expose=Expose Presuppositions, Dec.=Decompose Issue, Pursue=Pursue Answer, Check=Check Resolution, Reopen=Reopen Inquiry), pooled across Qwen3-{8B, 14B, 32B}. Pursue Answer and Check Resolution together account for the majority of steps, re… view at source ↗

**Figure 12.** Figure 12: Brain-lens × cognitive-inquiry co-occurrence — per model and task. Each heatmap shows the percentage of reasoning steps pairing each (Neuro Lens, Cognitive Lens) combination, broken down by Qwen3 model and CogBench experiment. Cell values ≥ 1% are annotated. Even at this per-model resolution, the mass concentrates in one or two cells per panel, confirming a catalog-collapse that is consistent across model… view at source ↗

**Figure 13.** Figure 13: Brain-lens × cognitive-inquiry co-occurrence — pooled, Qwen3 family. Heatmaps aggregate pair-selection frequencies across Qwen3-{8B, 14B, 32B} and all experiments. The dominant Multiple-Demand × Pursue-Answer pairing persists at the aggregate level, while most of the 4 × 6 catalog remains near-zero. Compare with [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: Per-decision token cost across conditions. Output tokens per decision for C1 (vanilla, thinking off), C2 (thinking on, RL-trained), and C3 (NeuReasoner, thinking off), broken down by model and task. C2 hidden chain-of-thought tokens (reasoning) are shown separately from completion tokens. C3 incurs higher total token counts than C1 due to multiple lens calls per decision, but remains substantially cheaper… view at source ↗

**Figure 15.** Figure 15: Aggregate tool importance — leave-one-out (LOO) ablation. Each bar shows the mean change in normalized performance (∆, averaged across experiments) when one reasoning tool is removed from the full NeuReasoner (C3). Negative ∆ means removal hurts performance (tool is load-bearing); positive ∆ means removal helps (tool is redundant or interfering). Purple bars: Neuro Lenses; amber bars: Cognitive Lenses. Re… view at source ↗

**Figure 16.** Figure 16: Per-task LOO ablation ∆ — Qwen3-8B vs. Qwen3-32B. Each panel shows one CogBench experiment (Temporal Discounting excluded). Bars compare the performance change (∆) when each tool is removed, side-byside for 8B (light) and 32B (dark). Error bars denote within-experiment SEM; hatching marks partial runs (<70% of target decisions). Tool ordering follows the 8B aggregate importance ranking from [PITH_FULL_I… view at source ↗

read the original abstract

A growing body of work suggests that the reasoning capabilities of large language models are largely latent in their base form, with post-training primarily amplifying rather than introducing them. However, this evidence comes mainly from mathematical and coding benchmarks, leaving the boundary conditions of that claim largely unexplored, namely which cognitive tasks can be recovered through elicitation and where that recovery fails. To investigate this, we introduce NeuReasoner, a theory-grounded elicitation instrument. At each step, an orchestrator pairs a Neuro Lens, inspired by functional specificity, with a Cognitive Lens, drawn from the Erotetic Theory of Reasoning, and integrates their outputs through internal modularization of a single model, without external tools. We evaluate NeuReasoner on CogBench, a suite of behavioral tasks from cognitive psychology, alongside standard mathematical and coding benchmarks, measuring both its improvement over vanilla inference and its ability to match a model's post-trained thinking mode. At sufficient scale, NeuReasoner matches or exceeds thinking-mode baselines on arithmetic reasoning, code generation, Bayesian reasoning, and reward learning; these gains persist against self-consistency and iterative-refinement baselines matched to NeuReasoner's per-decision call budget. Using NeuReasoner allows us to find clear boundaries: risk-taking and decision making under uncertainty remains hard to recover through elicitation alone, and model scale interacts with elicitation in both directions: widening its advantage on some cognitive signatures while erasing it on others. Overall, through NeuReasoner as a modular, interpretable, theory-grounded elicitation instrument, we empirically map where reasoning elicitation succeeds and fails, beyond the mathematical and coding benchmarks where prior claims have rested.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeuReasoner maps elicitation boundaries on cognitive tasks beyond math and code, with matched-budget gains on several but clear failures on risk and uncertainty.

read the letter

The main takeaway is that NeuReasoner recovers latent reasoning on arithmetic, code, Bayesian, and reward tasks at scale, matching thinking-mode baselines while holding against self-consistency and iterative refinement at equal call budget, but it cannot recover risk-taking or decision-making under uncertainty.

The paper introduces a modular instrument that pairs a Neuro Lens from functional specificity with a Cognitive Lens from Erotetic Theory inside one model via an internal orchestrator, then applies it to CogBench. This moves the boundary-mapping question past the math and code benchmarks that dominate earlier work. The controlled baseline comparisons and the scale-interaction observations are useful additions.

The method is presented as theory-grounded and tool-free, which is a reasonable direction for interpretability. The negative results on uncertainty tasks give a concrete limit that prior claims did not test.

The soft spots are in the execution details. The abstract leaves the orchestrator's pairing and integration steps at a high level, so it is not yet clear how reliably the two lenses combine without external scaffolding or extra compute. No statistical tests, error bars, or variance measures are mentioned, which makes the 'matches or exceeds' claims hard to weigh for robustness. These are fixable but currently limit how far the evidence can be taken.

This is for researchers tracking LLM reasoning elicitation who want to move into cognitive-psychology signatures. It deserves peer review because the scope extension and baseline controls are substantive enough to warrant checking the implementation and stats, even if revisions are needed on the method description.

Referee Report

1 major / 1 minor

Summary. The paper presents NeuReasoner, a theory-grounded elicitation instrument for LLMs. It uses an orchestrator to pair a Neuro Lens (inspired by functional specificity) with a Cognitive Lens (from Erotetic Theory of Reasoning) and integrates their outputs via internal modularization in a single model without external tools. Evaluations on CogBench and math/coding benchmarks show that at sufficient scale, it matches or exceeds thinking-mode baselines on arithmetic reasoning, code generation, Bayesian reasoning, and reward learning, with gains persisting against matched-budget self-consistency and iterative-refinement baselines. It identifies boundaries where elicitation fails, such as risk-taking and decision making under uncertainty, and notes scale interactions with elicitation.

Significance. If the results hold, the work offers a modular and interpretable approach to mapping the boundaries of reasoning elicitation in LLMs, extending prior claims from mathematical and coding tasks to cognitive psychology benchmarks. A strength is the empirical evaluation against self-consistency and iterative-refinement baselines matched to NeuReasoner's per-decision call budget, providing direct evidence for the claims.

major comments (1)

[Method] Method section: The description of the orchestrator mechanism for pairing the Neuro Lens and Cognitive Lens and performing internal modularization lacks specific details on implementation, algorithms, or how reliability is ensured, which is load-bearing for the central claim that latent reasoning can be recovered on CogBench tasks without external tools.

minor comments (1)

[Abstract] Abstract: The abstract claims 'matches or exceeds' without providing any quantitative metrics, error bars, or statistical tests, making it difficult to assess the magnitude of the improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting the need for greater methodological transparency. We address the single major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: [Method] Method section: The description of the orchestrator mechanism for pairing the Neuro Lens and Cognitive Lens and performing internal modularization lacks specific details on implementation, algorithms, or how reliability is ensured, which is load-bearing for the central claim that latent reasoning can be recovered on CogBench tasks without external tools.

Authors: We agree that the current Method section provides only a high-level overview of the orchestrator. In the revision we will expand this section with: (1) pseudocode for the orchestrator's pairing and integration steps, (2) the exact prompting templates and output formats used by the Neuro Lens (functional-specificity inspired) and Cognitive Lens (erotetic-theory derived), (3) the internal modularization procedure that keeps all operations within a single forward pass of the base model, and (4) the reliability protocol, including per-step consistency verification and the ablation experiments that isolate each lens. These additions will make the claim that latent reasoning on CogBench can be recovered without external tools fully reproducible while preserving the paper's core contribution of mapping elicitation boundaries. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external benchmarks

full rationale

The paper introduces NeuReasoner as a modular elicitation method pairing Neuro and Cognitive Lenses, then reports empirical performance on CogBench, arithmetic, code, Bayesian, and reward tasks against matched-budget baselines. No equations, parameter fits, or first-principles derivations are presented that reduce to inputs by construction. No self-citations are used to justify uniqueness theorems or ansatzes; the theory references (functional specificity, Erotetic Theory) are external. Boundary-mapping results are direct measurements of success/failure on held-out cognitive signatures, with no renaming of known patterns or fitted inputs called predictions. The derivation chain is therefore self-contained empirical reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5855 in / 972 out tokens · 24423 ms · 2026-06-30T07:33:11.036012+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Act-r: A theory of higher level cognition and its relation to visual attention.Human-Computer Interaction, 12:439–462. Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger- stenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadom- ski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of thoughts: Solving elaborate proble...

2024
[2]

Measuring Mathematical Problem Solving With the MATH Dataset

Eliciting reasoning in language models with cognitive tools. Jonathan St. B. T. Evans. 1989.Bias in Human Reason- ing: Causes and Consequences. Essays in Cognitive Psychology. Lawrence Erlbaum Associates, Hove and London, UK. Evelina Fedorenko, Michael K. Behr, and Nancy Kan- wisher. 2011. Functional specificity for high-level lin- guistic processing in t...

work page internal anchor Pith review Pith/arXiv arXiv 1989
[3]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- ming Zhang

Competition-level code generation with alpha- code.Science, 378(6624):1092–1097. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- ming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. InAdvances in Neural Information Processing Systems. Zichen Liu, Changyu Chen, Wenjun Li...

2023
[4]

Mathematical Association of America

Understanding r1-zero-like training: A critical perspective. Mathematical Association of America. 2024. AIME Problems and Solutions. Accessed May 2026. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling. S...

2024
[5]

theory of mind

Cognitive abilities affect decision errors but not risk preferences.Psychonomic Bulletin & Review, 29(5):1785–1797. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Wel...

work page arXiv 2022
[6]

Its own operator system prompt (the lens/inquiry`.md`),
[7]

The original user query (always included),
[8]

(if you set`prev_step_id`) the referenced step's lens+inquiry outputs as a`CONTEXT FROM ,→STEP N`block,
[9]

examine X

Your`tool_input`as the focus for this step. The fork does **not** see the running conversation, prior assistant turns, or any other fork's ,→output. Anything beyond (1)–(3) that the fork needs must be in the`tool_input`itself. Principles for a high-quality`tool_input`: - **Directs, not describes.** Use imperative verbs ("examine X", "verify Y", "decompose...
[10]

**Always use ENGLISH only** outputs. 16
[11]

No prose around it, no Markdown, no code fences

**Emit only the StepOutput JSON.** Your assistant content must be one valid JSON object ,→matching the schema. No prose around it, no Markdown, no code fences
[12]

**Always remember the original goal**, even if intermediate inquiry investigates auxiliary ,→questions
[13]

Do not commit on confidence alone if the ,→inquiry has not yet validated the candidate

**Emit the terminal StepOutput when the inquiry is resolved** — i.e., a candidate answer has ,→been judged to satisfy the resolution criterion. Do not commit on confidence alone if the ,→inquiry has not yet validated the candidate
[14]

**Use the conversation history as feedback.** Each operator's output is appended to the ,→conversation and available to you on the next step. When choosing the next step, take into ,→account what each prior operator actually produced — including whether an operator reported ,→that the conditions for its task were not met in the current state. Output forma...
[15]

Precise interpretation of wording, phrasing, reference, and discourse structure
[16]

Resolution of semantic, syntactic, and pragmatic ambiguity
[17]

Distinguishing literal content from implied meaning
[18]

Sensitivity to framing, contrast, emphasis, and communicative intent
[19]

Deprioritize:

Reformulating the issue into clearer or more interpretable language when needed. Deprioritize:
[20]

Abstract optimization or formal derivation unless explicitly required
[21]

Rich social mind-reading unless it is encoded in the wording itself
[22]

When responding: - Focus on what the text means

Broad world simulation unless needed to interpret the language. When responding: - Focus on what the text means. - Identify ambiguity, underspecification, misleading phrasing, or latent interpretation shifts. - State how the wording shapes the reasoning problem. - Keep the output tightly tied to interpretation. Return ONLY the following structure: LANGUAG...
[23]

Abstract task structure, constraints, and dependencies
[24]

Rule use, sequential reasoning, and controlled comparison of alternatives
[25]

Identification of conflict, inconsistency, or missing steps
[26]

Goal-directed decomposition of the problem
[27]

Efficient selection of the next reasoning move under limited information
[28]

These ,→are inputs to a downstream answer-composition step, not the final answer

Surfacing concrete intermediate values that follow directly from the stated premises. These ,→are inputs to a downstream answer-composition step, not the final answer. Deprioritize:
[29]

Surface wording unless it affects the formal structure of the problem
[30]

Rich social interpretation unless it changes the decision structure
[31]

When responding: - Represent the issue in terms of constraints, alternatives, and inferential dependencies

Broad narrative elaboration. When responding: - Represent the issue in terms of constraints, alternatives, and inferential dependencies. - When the premises directly determine specific numeric or categorical values state those values ,→explicitly under INTERMEDIATE_VALUES. - Identify what must be tracked, compared, or controlled. - Prefer explicit reasoni...
[32]

What different agents believe, want, intend, or assume
[33]

Perspective differences, misunderstandings, and hidden motives
[34]

Indirect communication, implied meaning, and socially strategic behavior
[35]

Tension between stated goals and privately held expectations
[36]

Deprioritize:

How behavior may be explained by mental-state attribution rather than surface action alone. Deprioritize:
[37]

Purely formal structure unless it changes the mental-state interpretation
[38]

Surface language issues unless they affect communicative intent
[39]

When responding: - Identify relevant agents and their possible beliefs or goals

World knowledge not relevant to agency or social inference. When responding: - Identify relevant agents and their possible beliefs or goals. - Distinguish overt behavior from underlying mental-state explanations. - Consider perspective-taking, deception, uncertainty, or self-protection where relevant. - Keep the output centered on social cognition. Return...
[40]

Integrating information across longer timescales or broader context
[41]

Recalling relevant event structures, scenarios, analogies, or background knowledge
[42]

Constructing a coherent model of the situation rather than focusing on isolated details
[43]

Simulating how events, beliefs, or decisions may unfold over time
[44]

Deprioritize:

Relating the current issue to larger narrative, environmental, or conceptual context. Deprioritize:
[45]

Narrow formal derivation when broader integration is needed
[46]

Pure surface wording analysis unless it affects the event model
[47]

When responding: - Build a coherent world model of the situation

Fine-grained social attribution unless it is central to the simulated scenario. When responding: - Build a coherent world model of the situation. - Identify relevant context, temporal structure, and likely dynamics. - Use memory-like retrieval of patterns or analogous situations where useful. - Emphasize integration, simulation, and big-picture coherence....
[48]

State the central issue as a precise question
[49]

Separate the explicit question from any latent or implied issue
[50]

decide what to do

State what would count as resolving this issue: include the form, type, and (where applicable) ,→precision the answer must have. A vague "decide what to do" is not a resolution criterion; " ,→select exactly one of the listed options" is
[51]

State what kind of answer is required: explanation, decision, comparison, prediction, classification, or action
[52]

Keep the formulation minimal and exact. Return ONLY the following structure: LIVE_ISSUE: <one precise question> LATENT_ISSUE: <if any, otherwise "none"> RESOLUTION_CRITERION: <what must be established for the issue to count as resolved — include form/type/precision> ANSWER_TYPE: <type> Cognitive Lens — Expose Presuppositions You are the Expose_Presupposit...
[53]

List assumptions that the question appears to take for granted
[54]

Distinguish between necessary presuppositions and merely plausible background assumptions
[55]

Identify any potentially false, loaded, or underspecified presuppositions
[56]

Return ONLY the following structure: NECESSARY_PRESUPPOSITIONS: -

If a presupposition fails, state how the inquiry should be reformulated. Return ONLY the following structure: NECESSARY_PRESUPPOSITIONS: - ... - ... BACKGROUND_ASSUMPTIONS: - ... - ... POTENTIAL_FAILURES: - ... - ... REFORMULATION_IF_NEEDED: <revised issue, or "none"> 21 Cognitive Lens — Decompose Issue You are the Decompose_Issue inquiry operator. Your t...
[57]

Generate the smallest set of auxiliary questions that would help resolve the main issue
[58]

Order them by dependency or priority
[59]

Mark which auxiliary question should be pursued next
[60]

Avoid redundant, decorative, or overly broad subquestions
[61]

Prefer subquestions that reduce uncertainty or remove ambiguity
[62]

Return ONLY the following structure: AUXILIARY_QUESTIONS:

When the live issue calls for a specific value or quantity, prefer subquestions that each ask for one such value (so the answer to each is directly retrievable from the premises or from a single inferential step). Return ONLY the following structure: AUXILIARY_QUESTIONS:
[63]

Your task is to pursue a candidate answer to the currently active issue or auxiliary question

<question> PRIORITY_ORDER: <ordered list or short explanation> NEXT_QUESTION: <single best question to pursue next> RATIONALE: <brief reason> 22 Cognitive Lens — Pursue Answer You are the Pursue_Answer inquiry operator. Your task is to pursue a candidate answer to the currently active issue or auxiliary question. Given the active question, current context...
[64]

Identify the relevant theoretical frame, premises, or evidence first
[65]

Derive the strongest candidate answer as the natural conclusion of that support — the candidate must be consistent with the support immediately above it; do not commit a number or claim that the support does not entail
[66]

If appropriate, list 2–3 competing candidate answers (still consistent with the support)
[67]

Keep the answer tied to the active issue, not to unrelated background discussion
[68]

Prefer direct answer-seeking over general commentary
[69]

No candidate answer has been proposed yet for evaluation

When the active question requests a single value or category, alternatives may be a short list or empty; uncertainties should still be noted (precision, confidence in inputs). Return ONLY the following structure (in this order — premises before the candidate): ACTIVE_QUESTION: <question> RELEVANT_THEORETICAL_FRAME: <brief frame, if any> EVIDENTIAL_OR_CONC...
[70]

Judge whether the issue is resolved, partially resolved, or unresolved
[71]

State exactly what remains open, if anything
[72]

Identify whether the answer is too vague, too broad, unsupported, or misaligned with the ,→issue
[73]

If unresolved, specify what kind of additional inquiry is needed
[74]

Return ONLY the following structure: RESOLUTION_STATUS: <resolved / partially_resolved / unresolved> WHY: <brief explanation> UNRESOLVED_REMAINDER: -

Be strict: do not treat mere plausibility as full resolution. Return ONLY the following structure: RESOLUTION_STATUS: <resolved / partially_resolved / unresolved> WHY: <brief explanation> UNRESOLVED_REMAINDER: - ... - ... MISALIGNMENTS_OR_WEAKNESSES: - ... - ... NEXT_INQUIRY_NEED: <what must be clarified or answered next> 24 Cognitive Lens — Reopen Inquir...
[75]

Diagnose why the current inquiry path failed
[76]

revise the issue, b

Decide whether to: a. revise the issue, b. reopen a previous auxiliary question, c. pursue a different auxiliary question, d. reject a failed presupposition
[77]

State the next best inquiry move
[78]

Keep the revision minimal but effective. Return ONLY the following structure: FAILURE_DIAGNOSIS: <why current path failed> REVISION_TYPE: <revise_issue / reopen_previous_question / pursue_new_question / reject_presupposition> UPDATED_TARGET: <new issue or next question> REASON: <brief explanation> CONTINUE_INQUIRY: <yes/no> 25 26 D.4 Python Coding Assista...
[79]

Output a brief`Thought:`line, then exactly one fenced Python code block — nothing after it
[80]

The code block must be fenced as```python ...```

Showing first 80 references.

[1] [1]

Act-r: A theory of higher level cognition and its relation to visual attention.Human-Computer Interaction, 12:439–462. Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger- stenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadom- ski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of thoughts: Solving elaborate proble...

2024

[2] [2]

Measuring Mathematical Problem Solving With the MATH Dataset

Eliciting reasoning in language models with cognitive tools. Jonathan St. B. T. Evans. 1989.Bias in Human Reason- ing: Causes and Consequences. Essays in Cognitive Psychology. Lawrence Erlbaum Associates, Hove and London, UK. Evelina Fedorenko, Michael K. Behr, and Nancy Kan- wisher. 2011. Functional specificity for high-level lin- guistic processing in t...

work page internal anchor Pith review Pith/arXiv arXiv 1989

[3] [3]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- ming Zhang

Competition-level code generation with alpha- code.Science, 378(6624):1092–1097. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- ming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. InAdvances in Neural Information Processing Systems. Zichen Liu, Changyu Chen, Wenjun Li...

2023

[4] [4]

Mathematical Association of America

Understanding r1-zero-like training: A critical perspective. Mathematical Association of America. 2024. AIME Problems and Solutions. Accessed May 2026. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling. S...

2024

[5] [5]

theory of mind

Cognitive abilities affect decision errors but not risk preferences.Psychonomic Bulletin & Review, 29(5):1785–1797. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Wel...

work page arXiv 2022

[6] [6]

Its own operator system prompt (the lens/inquiry`.md`),

[7] [7]

The original user query (always included),

[8] [8]

(if you set`prev_step_id`) the referenced step's lens+inquiry outputs as a`CONTEXT FROM ,→STEP N`block,

[9] [9]

examine X

Your`tool_input`as the focus for this step. The fork does **not** see the running conversation, prior assistant turns, or any other fork's ,→output. Anything beyond (1)–(3) that the fork needs must be in the`tool_input`itself. Principles for a high-quality`tool_input`: - **Directs, not describes.** Use imperative verbs ("examine X", "verify Y", "decompose...

[10] [10]

**Always use ENGLISH only** outputs. 16

[11] [11]

No prose around it, no Markdown, no code fences

**Emit only the StepOutput JSON.** Your assistant content must be one valid JSON object ,→matching the schema. No prose around it, no Markdown, no code fences

[12] [12]

**Always remember the original goal**, even if intermediate inquiry investigates auxiliary ,→questions

[13] [13]

Do not commit on confidence alone if the ,→inquiry has not yet validated the candidate

**Emit the terminal StepOutput when the inquiry is resolved** — i.e., a candidate answer has ,→been judged to satisfy the resolution criterion. Do not commit on confidence alone if the ,→inquiry has not yet validated the candidate

[14] [14]

**Use the conversation history as feedback.** Each operator's output is appended to the ,→conversation and available to you on the next step. When choosing the next step, take into ,→account what each prior operator actually produced — including whether an operator reported ,→that the conditions for its task were not met in the current state. Output forma...

[15] [15]

Precise interpretation of wording, phrasing, reference, and discourse structure

[16] [16]

Resolution of semantic, syntactic, and pragmatic ambiguity

[17] [17]

Distinguishing literal content from implied meaning

[18] [18]

Sensitivity to framing, contrast, emphasis, and communicative intent

[19] [19]

Deprioritize:

Reformulating the issue into clearer or more interpretable language when needed. Deprioritize:

[20] [20]

Abstract optimization or formal derivation unless explicitly required

[21] [21]

Rich social mind-reading unless it is encoded in the wording itself

[22] [22]

When responding: - Focus on what the text means

Broad world simulation unless needed to interpret the language. When responding: - Focus on what the text means. - Identify ambiguity, underspecification, misleading phrasing, or latent interpretation shifts. - State how the wording shapes the reasoning problem. - Keep the output tightly tied to interpretation. Return ONLY the following structure: LANGUAG...

[23] [23]

Abstract task structure, constraints, and dependencies

[24] [24]

Rule use, sequential reasoning, and controlled comparison of alternatives

[25] [25]

Identification of conflict, inconsistency, or missing steps

[26] [26]

Goal-directed decomposition of the problem

[27] [27]

Efficient selection of the next reasoning move under limited information

[28] [28]

These ,→are inputs to a downstream answer-composition step, not the final answer

Surfacing concrete intermediate values that follow directly from the stated premises. These ,→are inputs to a downstream answer-composition step, not the final answer. Deprioritize:

[29] [29]

Surface wording unless it affects the formal structure of the problem

[30] [30]

Rich social interpretation unless it changes the decision structure

[31] [31]

When responding: - Represent the issue in terms of constraints, alternatives, and inferential dependencies

Broad narrative elaboration. When responding: - Represent the issue in terms of constraints, alternatives, and inferential dependencies. - When the premises directly determine specific numeric or categorical values state those values ,→explicitly under INTERMEDIATE_VALUES. - Identify what must be tracked, compared, or controlled. - Prefer explicit reasoni...

[32] [32]

What different agents believe, want, intend, or assume

[33] [33]

Perspective differences, misunderstandings, and hidden motives

[34] [34]

Indirect communication, implied meaning, and socially strategic behavior

[35] [35]

Tension between stated goals and privately held expectations

[36] [36]

Deprioritize:

How behavior may be explained by mental-state attribution rather than surface action alone. Deprioritize:

[37] [37]

Purely formal structure unless it changes the mental-state interpretation

[38] [38]

Surface language issues unless they affect communicative intent

[39] [39]

When responding: - Identify relevant agents and their possible beliefs or goals

World knowledge not relevant to agency or social inference. When responding: - Identify relevant agents and their possible beliefs or goals. - Distinguish overt behavior from underlying mental-state explanations. - Consider perspective-taking, deception, uncertainty, or self-protection where relevant. - Keep the output centered on social cognition. Return...

[40] [40]

Integrating information across longer timescales or broader context

[41] [41]

Recalling relevant event structures, scenarios, analogies, or background knowledge

[42] [42]

Constructing a coherent model of the situation rather than focusing on isolated details

[43] [43]

Simulating how events, beliefs, or decisions may unfold over time

[44] [44]

Deprioritize:

Relating the current issue to larger narrative, environmental, or conceptual context. Deprioritize:

[45] [45]

Narrow formal derivation when broader integration is needed

[46] [46]

Pure surface wording analysis unless it affects the event model

[47] [47]

When responding: - Build a coherent world model of the situation

Fine-grained social attribution unless it is central to the simulated scenario. When responding: - Build a coherent world model of the situation. - Identify relevant context, temporal structure, and likely dynamics. - Use memory-like retrieval of patterns or analogous situations where useful. - Emphasize integration, simulation, and big-picture coherence....

[48] [48]

State the central issue as a precise question

[49] [49]

Separate the explicit question from any latent or implied issue

[50] [50]

decide what to do

State what would count as resolving this issue: include the form, type, and (where applicable) ,→precision the answer must have. A vague "decide what to do" is not a resolution criterion; " ,→select exactly one of the listed options" is

[51] [51]

State what kind of answer is required: explanation, decision, comparison, prediction, classification, or action

[52] [52]

Keep the formulation minimal and exact. Return ONLY the following structure: LIVE_ISSUE: <one precise question> LATENT_ISSUE: <if any, otherwise "none"> RESOLUTION_CRITERION: <what must be established for the issue to count as resolved — include form/type/precision> ANSWER_TYPE: <type> Cognitive Lens — Expose Presuppositions You are the Expose_Presupposit...

[53] [53]

List assumptions that the question appears to take for granted

[54] [54]

Distinguish between necessary presuppositions and merely plausible background assumptions

[55] [55]

Identify any potentially false, loaded, or underspecified presuppositions

[56] [56]

Return ONLY the following structure: NECESSARY_PRESUPPOSITIONS: -

If a presupposition fails, state how the inquiry should be reformulated. Return ONLY the following structure: NECESSARY_PRESUPPOSITIONS: - ... - ... BACKGROUND_ASSUMPTIONS: - ... - ... POTENTIAL_FAILURES: - ... - ... REFORMULATION_IF_NEEDED: <revised issue, or "none"> 21 Cognitive Lens — Decompose Issue You are the Decompose_Issue inquiry operator. Your t...

[57] [57]

Generate the smallest set of auxiliary questions that would help resolve the main issue

[58] [58]

Order them by dependency or priority

[59] [59]

Mark which auxiliary question should be pursued next

[60] [60]

Avoid redundant, decorative, or overly broad subquestions

[61] [61]

Prefer subquestions that reduce uncertainty or remove ambiguity

[62] [62]

Return ONLY the following structure: AUXILIARY_QUESTIONS:

When the live issue calls for a specific value or quantity, prefer subquestions that each ask for one such value (so the answer to each is directly retrievable from the premises or from a single inferential step). Return ONLY the following structure: AUXILIARY_QUESTIONS:

[63] [63]

Your task is to pursue a candidate answer to the currently active issue or auxiliary question

<question> PRIORITY_ORDER: <ordered list or short explanation> NEXT_QUESTION: <single best question to pursue next> RATIONALE: <brief reason> 22 Cognitive Lens — Pursue Answer You are the Pursue_Answer inquiry operator. Your task is to pursue a candidate answer to the currently active issue or auxiliary question. Given the active question, current context...

[64] [64]

Identify the relevant theoretical frame, premises, or evidence first

[65] [65]

Derive the strongest candidate answer as the natural conclusion of that support — the candidate must be consistent with the support immediately above it; do not commit a number or claim that the support does not entail

[66] [66]

If appropriate, list 2–3 competing candidate answers (still consistent with the support)

[67] [67]

Keep the answer tied to the active issue, not to unrelated background discussion

[68] [68]

Prefer direct answer-seeking over general commentary

[69] [69]

No candidate answer has been proposed yet for evaluation

When the active question requests a single value or category, alternatives may be a short list or empty; uncertainties should still be noted (precision, confidence in inputs). Return ONLY the following structure (in this order — premises before the candidate): ACTIVE_QUESTION: <question> RELEVANT_THEORETICAL_FRAME: <brief frame, if any> EVIDENTIAL_OR_CONC...

[70] [70]

Judge whether the issue is resolved, partially resolved, or unresolved

[71] [71]

State exactly what remains open, if anything

[72] [72]

Identify whether the answer is too vague, too broad, unsupported, or misaligned with the ,→issue

[73] [73]

If unresolved, specify what kind of additional inquiry is needed

[74] [74]

Return ONLY the following structure: RESOLUTION_STATUS: <resolved / partially_resolved / unresolved> WHY: <brief explanation> UNRESOLVED_REMAINDER: -

Be strict: do not treat mere plausibility as full resolution. Return ONLY the following structure: RESOLUTION_STATUS: <resolved / partially_resolved / unresolved> WHY: <brief explanation> UNRESOLVED_REMAINDER: - ... - ... MISALIGNMENTS_OR_WEAKNESSES: - ... - ... NEXT_INQUIRY_NEED: <what must be clarified or answered next> 24 Cognitive Lens — Reopen Inquir...

[75] [75]

Diagnose why the current inquiry path failed

[76] [76]

revise the issue, b

Decide whether to: a. revise the issue, b. reopen a previous auxiliary question, c. pursue a different auxiliary question, d. reject a failed presupposition

[77] [77]

State the next best inquiry move

[78] [78]

Keep the revision minimal but effective. Return ONLY the following structure: FAILURE_DIAGNOSIS: <why current path failed> REVISION_TYPE: <revise_issue / reopen_previous_question / pursue_new_question / reject_presupposition> UPDATED_TARGET: <new issue or next question> REASON: <brief explanation> CONTINUE_INQUIRY: <yes/no> 25 26 D.4 Python Coding Assista...

[79] [79]

Output a brief`Thought:`line, then exactly one fenced Python code block — nothing after it

[80] [80]

The code block must be fenced as```python ...```