Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

Dinesh Manocha; Sanjoy Chowdhury

arxiv: 2605.14358 · v1 · pith:CCAUQ6O4new · submitted 2026-05-14 · 💻 cs.AI · cs.LG

Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

Sanjoy Chowdhury , Dinesh Manocha This is my paper

Pith reviewed 2026-06-30 21:01 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords minimal coreovercomplete reasoning traceschain-of-thoughtlanguage modelsreasoning compressionpredictive necessitytrace geometry

0 comments

The pith

Minimal cores isolate the necessary steps in language model reasoning traces by removing nearly half while preserving answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines minimal cores as the smallest subsets of steps in a generated chain-of-thought trace that still preserve either the final answer or the full predictive distribution. It reports that across six benchmarks in arithmetic, math, science, and QA, greedy extraction removes 46 percent of steps on average while keeping the original answer in 86 percent of cases. Necessity concentrates so that the top three steps carry 65 percent of the measured mass, and the resulting minimal cores show 11-point better separation of correct versus incorrect traces, 34 percent lower intrinsic dimensionality, and 85 percent answer retention when transferred across model families. Readers would care because the work supplies both a practical compression method and theoretical certificates for overcompleteness and sparse support.

Core claim

We define the minimal core as the smallest subset of steps that preserves either the final answer or predictive distribution. Across six deliberative reasoning benchmarks spanning arithmetic, competition mathematics, expert scientific reasoning, and commonsense multi-hop QA, we find substantial overcompleteness: on average, 46% of steps are removable under greedy minimal-core extraction while preserving the original answer in 86% of cases. We also find that predictive support is concentrated: the top three steps account for 65% of measured necessity mass on average. Beyond compression, minimal cores expose a cleaner geometry of reasoning: compared with full traces, they improve correct-incor

What carries the argument

The minimal core, defined as the smallest subset of reasoning steps that preserves the final answer or predictive distribution under greedy elimination.

If this is right

Greedy extraction yields an average compression ratio of 46 percent of steps while retaining the answer in 86 percent of cases.
Necessity mass concentrates so the top three steps account for 65 percent on average.
Minimal cores improve correct-incorrect trace separation by 11 points over full traces.
Estimated intrinsic dimensionality drops 34 percent in minimal cores.
Minimal cores retain answers at 85 percent when transferred across model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If minimal cores capture effective support, then generation-time pruning of non-core steps could reduce token usage with little accuracy loss.
The observed concentration of necessity suggests step-level importance scores could be computed directly from extraction runs for interpretability.
The geometric improvements imply minimal cores may align more closely with the model's internal decision geometry than full traces do.
The existence and irreducibility certificates could be turned into practical audits that flag overcomplete traces in deployed systems.

Load-bearing premise

Removing steps while preserving the final answer or predictive distribution means those steps were unnecessary for the model's internal computation.

What would settle it

Running the same model on held-out problems using only the extracted minimal-core steps and observing whether the answer distribution matches the full-trace distribution.

Figures

Figures reproduced from arXiv: 2605.14358 by Dinesh Manocha, Sanjoy Chowdhury.

**Figure 1.** Figure 1: LLM reasoning traces are substantially overcomplete. Generated traces contain a sparse predictive core surrounded by removable reasoning mass. Minimal-core extraction separates answersupporting steps from redundant elaboration and exposes a compact, transferable representation geometry of reasoning. We study this problem through the lens of overcomplete reasoning traces. A trace is overcomplete if it cont… view at source ↗

**Figure 2.** Figure 2: Compact diagnostics for overcomplete reasoning traces. The panels show: (a) greedy deletion degrades more gracefully than necessity-guided or random pruning beyond the matched budgets in Tab. 2; (b) redundancy mass concentrates at moderate-to-high values for both math and non-math tasks; (c) necessity mass is concentrated in a few steps, above a uniform baseline; (d) minimal-core length grows sublinearly w… view at source ↗

**Figure 3.** Figure 3: Expanded diagnostics for overcomplete reasoning traces. Shaded bands show cross-model [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: Transfer analysis. Minimal cores transfer strongly off-diagonal and preserve most of [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Extraction-method tradeoffs. Greedy deletion lies at the high-fidelity end of the frontier, [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Robustness diagnostics across sufficiency criteria, prompting styles, difficulty levels, and [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Additional representation diagnostics. Minimal cores improve correctness predictiveness [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative examples of overcomplete reasoning traces, part I. Green steps are retained in the extracted minimal core; red steps are removed by sufficiency-preserving deletion. Removed steps often include verification, caveats, alternative paths, or explanatory scaffolding, while minimal cores retain the decisive operations or factual links needed for the prediction. Takeaway. Tab. 12 and [PITH_FULL_IMAGE… view at source ↗

**Figure 9.** Figure 9: Qualitative examples of overcomplete reasoning traces, part II. Minimal cores retain the decisive factual links needed for the answer, while removable steps often contain setup, caveats, or explanatory context that does not change the prediction. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative examples of overcomplete reasoning traces, part III. Even short correct traces can contain removable setup or optional checking. The extracted minimal core keeps the operations needed to preserve the answer while deleting behaviorally redundant steps. 1 2 3 4 5 0.2 0.4 0.6 0.8 Top-k steps Cumulative mass (a) Cumulative necessity mass All avg. ± range All avg. Math avg. Non-math avg. GSM8K MATH… view at source ↗

**Figure 11.** Figure 11: Additional necessity diagnostics. Panel (a) shows cumulative top- [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

read the original abstract

Language models often generate long chain-of-thought traces, but it remains unclear how much of this reasoning is necessary for preserving the final prediction. We study this through the lens of overcomplete reasoning traces: generated traces that contain more intermediate steps than are needed to support the model's answer. We define the minimal core as the smallest subset of steps that preserves either the final answer or predictive distribution, and introduce metrics for compression ratio, redundancy mass, step necessity, and necessity concentration. Across six deliberative reasoning benchmarks spanning arithmetic, competition mathematics, expert scientific reasoning, and commonsense multi-hop QA, we find substantial overcompleteness: on average, 46% of steps are removable under greedy minimal-core extraction while preserving the original answer in 86% of cases. We also find that predictive support is concentrated: the top three steps account for 65% of measured necessity mass on average. Beyond compression, minimal cores expose a cleaner geometry of reasoning: compared with full traces, they improve correct-incorrect trace separation by 11 points, reduce estimated intrinsic dimensionality by 34%, and transfer across model families with 85% off-diagonal answer retention. Theoretically, we establish existence of minimal sufficient subsets, local irreducibility guarantees for greedy elimination, and certificates of overcompleteness and sparse necessity. Together, these results suggest that full reasoning traces are often verbose and overcomplete, while minimal cores isolate the effective support underlying language-model predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies new metrics for measuring overcompleteness in reasoning traces and reports consistent numbers across benchmarks, but the geometry claims rest only on output preservation.

read the letter

The main takeaway is that this work defines a minimal core as the smallest subset of CoT steps that keeps the final answer or distribution unchanged under greedy removal, then measures how much of typical traces can be stripped away. On six benchmarks they report 46% of steps removable on average while preserving the answer in 86% of cases, with necessity concentrated in the top three steps at 65% of the mass.

What is actually new are the concrete quantities: compression ratio, redundancy mass, step necessity, and necessity concentration, plus the existence proofs and local irreducibility guarantees for the extraction procedure. The cross-model transfer result at 85% off-diagonal retention is a straightforward empirical check that stands on its own. These give people working on trace analysis something specific to compute and compare.

The softer part is the geometry story. The paper states that minimal cores improve correct-incorrect separation by 11 points and cut estimated intrinsic dimensionality by 34%, but both are computed from the output behavior after removal. Nothing in the abstract shows direct internal measurements such as hidden-state interventions or attention ablations, so the claim that these cores expose the "effective support" or a "cleaner geometry of reasoning" stays one step removed from the model's actual computation. The stress-test concern lands here.

This is for interpretability researchers who want to quantify redundancy in deliberative traces. A reader can extract the metrics and test them on their own data even if the internal interpretation needs tightening. It deserves peer review because the empirical patterns and the theoretical certificates are new enough to warrant referee scrutiny, provided the authors add causal checks on the geometry claims.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that language model reasoning traces are often overcomplete, defining minimal cores as the smallest subsets of steps preserving the final answer or predictive distribution under greedy extraction. Across six benchmarks it reports 46% average step removability while preserving answers in 86% of cases, with necessity concentrated (top-3 steps account for 65% of necessity mass), plus geometric gains (11-point improvement in correct-incorrect separation, 34% reduction in intrinsic dimensionality) and 85% cross-model transfer. Theoretical results establish existence of minimal sufficient subsets and local irreducibility of the extraction procedure.

Significance. If the interpretation linking output preservation to internal effective support were substantiated, the work would supply concrete compression metrics and geometric diagnostics for reasoning traces, with implications for interpretability and efficiency. The scale of the empirical survey and the existence/irreducibility theorems would constitute a useful contribution to the study of verbose CoT traces.

major comments (1)

[Abstract] Abstract: The claim that minimal cores 'isolate the effective support underlying language-model predictions' and expose a 'cleaner geometry of reasoning' is not supported by evidence that removed steps are irrelevant to the model's internal forward pass. All metrics (necessity mass, necessity concentration, separation, dimensionality) are computed solely from output invariance under greedy elimination; the manuscript describes no causal interventions on hidden states, attention ablation, or logit-lens probes on intermediate tokens that would confirm internal routing through the retained steps. This gap is load-bearing for the representation-geometry claims in the title and abstract.

minor comments (2)

The abstract states results across six benchmarks without reporting per-benchmark breakdowns, sample sizes, or confidence intervals on the headline percentages (46%, 86%, 65%, 11 points, 34%, 85%), making it difficult to assess robustness or variability.
Notation for the necessity-mass and necessity-concentration metrics is introduced in the abstract but not connected to the precise definition of the greedy extraction procedure or to the theoretical certificates mentioned later.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comment correctly identifies that our claims about isolating effective support and representation geometry rest on output-based metrics rather than internal causal evidence. We address this below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that minimal cores 'isolate the effective support underlying language-model predictions' and expose a 'cleaner geometry of reasoning' is not supported by evidence that removed steps are irrelevant to the model's internal forward pass. All metrics (necessity mass, necessity concentration, separation, dimensionality) are computed solely from output invariance under greedy elimination; the manuscript describes no causal interventions on hidden states, attention ablation, or logit-lens probes on intermediate tokens that would confirm internal routing through the retained steps. This gap is load-bearing for the representation-geometry claims in the title and abstract.

Authors: We agree that the manuscript provides no causal interventions, hidden-state ablations, or logit-lens analyses to establish that removed steps are irrelevant to the internal forward pass. All reported metrics (necessity mass, concentration, separation, and dimensionality reduction) are defined strictly in terms of output invariance under greedy step elimination. The interpretation that minimal cores isolate 'effective support underlying language-model predictions' is therefore an inference from behavioral necessity rather than direct internal evidence. To correct this, we will revise the abstract, title, and introduction to frame the contribution in terms of output-preserving minimal cores and their observable geometric properties in the space of reasoning traces. We will also add an explicit limitations paragraph stating that the work does not claim to identify internal routing or causal support within the model. These changes will be made in the next revision. revision: yes

Circularity Check

0 steps flagged

No circularity: operational definitions and empirical measurements are independent of target claims

full rationale

The paper defines the minimal core explicitly as the smallest subset preserving final answer or predictive distribution under greedy extraction, then computes compression ratio, necessity mass, and geometric metrics directly from that procedure and model outputs. Theoretical results establish existence and local irreducibility via standard set-theoretic arguments without self-referential equations or fitted parameters. No load-bearing step reduces to a self-citation, ansatz smuggled via prior work, or renaming of known results; all quantities are measured from the defined extraction process itself. The interpretation linking output preservation to internal necessity is interpretive rather than a derivation that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities; all claims are stated at the level of definitions and aggregate statistics.

pith-pipeline@v0.9.1-grok · 5789 in / 1108 out tokens · 38546 ms · 2026-06-30T21:01:55.434583+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 34 canonical work pages · 13 internal anchors

[1]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

2022
[2]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, 2022

2022
[3]

Least-to-most prompting enables complex reasoning in large language models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. InInternational Conference on Learning Representations, 2023

2023
[4]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, 2023

2023
[5]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

2023
[6]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

work page arXiv 2025
[8]

Overclocking llm reasoning: Monitoring and controlling thinking path lengths in llms.arXiv preprint arXiv:2506.07240, 2025

Roy Eisenstadt, Itamar Zimerman, and Lior Wolf. Overclocking llm reasoning: Monitoring and controlling thinking path lengths in llms.arXiv preprint arXiv:2506.07240, 2025

work page arXiv 2025
[9]

Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

work page arXiv 2025
[10]

A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond.arXiv preprint arXiv:2503.21614, 2025

Xiaoye Qu, Yafu Li, Zhao-Chen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, et al. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond.arXiv preprint arXiv:2503.21614, 2025

work page arXiv 2025
[11]

Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

work page arXiv 2025
[12]

Don’t overthink it

Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don’t overthink it. preferring shorter thinking chains for improved llm reasoning.arXiv preprint arXiv:2505.17813, 2025

work page arXiv 2025
[13]

Logicreward: Incentivizing llm reasoning via step-wise logical supervision.arXiv preprint arXiv:2512.18196, 2025

Jundong Xu, Hao Fei, Huichi Zhou, Xin Quan, Qijun Huang, Shengqiong Wu, William Yang Wang, Mong-Li Lee, and Wynne Hsu. Logicreward: Incentivizing llm reasoning via step-wise logical supervision.arXiv preprint arXiv:2512.18196, 2025

work page arXiv 2025
[14]

General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025

Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025

work page arXiv 2025
[15]

Code execution as grounded supervision for llm reasoning

Dongwon Jung, Wenxuan Zhou, and Muhao Chen. Code execution as grounded supervision for llm reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24822–24833, 2025

2025
[16]

Reason-to-recommend: Using interaction-of-thought reasoning to enhance llm recommendation.arXiv preprint arXiv:2506.05069, 2025

Keyu Zhao, Fengli Xu, and Yong Li. Reason-to-recommend: Using interaction-of-thought reasoning to enhance llm recommendation.arXiv preprint arXiv:2506.05069, 2025

work page arXiv 2025
[17]

Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling

Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3613–3635, 2025

2025
[18]

A theoretical study on bridging internal probability and self-consistency for llm reasoning.arXiv preprint arXiv:2510.15444, 2025

Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, and Xiaoxing Ma. A theoretical study on bridging internal probability and self-consistency for llm reasoning.arXiv preprint arXiv:2510.15444, 2025. 11

work page arXiv 2025
[19]

Confidence improves self-consistency in llms

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20090–20111, 2025

2025
[20]

Consistent paths lead to truth: Self-rewarding reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.08745, 2025

Kongcheng Zhang, Qi Yao, Shunyu Liu, Yingjie Wang, Baisheng Lai, Jieping Ye, Mingli Song, and Dacheng Tao. Consistent paths lead to truth: Self-rewarding reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.08745, 2025

work page arXiv 2025
[21]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, 2022

2022
[22]

Think twice: Enhancing llm reasoning by scaling multi-round test-time thinking.arXiv preprint arXiv:2503.19855, 2025

Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yunjie Ji, Yiping Peng, Han Zhao, and Xiangang Li. Think twice: Enhancing llm reasoning by scaling multi-round test-time thinking.arXiv preprint arXiv:2503.19855, 2025

work page arXiv 2025
[23]

Atom of thoughts for markov llm test-time scaling.arXiv preprint arXiv:2502.12018, 2025

Fengwei Teng, Quan Shi, Zhaoyang Yu, Jiayi Zhang, Yuyu Luo, Chenglin Wu, and Zhijiang Guo. Atom of thoughts for markov llm test-time scaling.arXiv preprint arXiv:2502.12018, 2025

work page arXiv 2025
[24]

Towards thinking-optimal scaling of test-time compute for llm reasoning.arXiv preprint arXiv:2502.18080, 2025

Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test-time compute for llm reasoning.arXiv preprint arXiv:2502.18080, 2025

work page arXiv 2025
[25]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[26]

LIMO: Less is More for Reasoning

Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Trims: Real-time tracking of minimal sufficient length for efficient reasoning via reinforcement learning.arXiv preprint arXiv:2603.17449, 2026

Yubo Wang, Yuhui Li, Zhiwei Zhang, and Wenhu Chen. Trims: Real-time tracking of minimal sufficient length for efficient reasoning via reinforcement learning.arXiv preprint arXiv:2603.17449, 2026

work page arXiv 2026
[30]

Surgical trimming: Minimal sufficient chain of thought with razorreward-rl.arXiv preprint, 2025

Xinyu Chen, Zihan Liu, Kai Zhang, and Yizhong Wang. Surgical trimming: Minimal sufficient chain of thought with razorreward-rl.arXiv preprint, 2025

2025
[31]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning, pages 3519–3529, 2019

2019
[32]

John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. InProceedings of NAACL-HLT, pages 4129–4138, 2019

2019
[33]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

2022
[34]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of ACL-IJCNLP, pages 7319–7328, 2021

2021
[35]

Inference- time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems, 2023

2023
[36]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. InarXiv preprint arXiv:2110.14168, 2021. 12

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

2021
[39]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. In Transactions of the Association for Computational Linguistics, volume 9, pages 346–361, 2021

2021
[41]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[42]

Rousseeuw

Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.Journal of Computational and Applied Mathematics, 20:53–65, 1987

1987
[43]

A cluster separation measure.IEEE transactions on pattern analysis and machine intelligence, (2):224–227, 1979

David L Davies and Donald W Bouldin. A cluster separation measure.IEEE transactions on pattern analysis and machine intelligence, (2):224–227, 1979

1979
[44]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Ques- tion decomposition improves the faithfulness of model-generated reasoning.arXiv preprint arXiv:2307.11768, 2023

Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Her- nandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, et al. Ques- tion decomposition improves the faithfulness of model-generated reasoning.arXiv preprint arXiv:2307.11768, 2023

work page arXiv 2023
[48]

Chain-of-question: A progressive question decomposition approach for complex knowledge base question answering

Peng Yixing, Quan Wang, Licheng Zhang, Yi Liu, and Zhendong Mao. Chain-of-question: A progressive question decomposition approach for complex knowledge base question answering. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4763–4776, 2024

2024
[49]

Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search

Xinzhe Li. Chain-in-tree: Back to sequential reasoning in llm tree search.arXiv preprint arXiv:2509.25835, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Verify-and-edit: A knowledge-enhanced chain-of-thought framework

Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5823–5840, 2023

2023
[51]

Deductive verification of chain-of-thought reasoning.Advances in Neural Information Processing Systems, 36:36407–36433, 2023

Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning.Advances in Neural Information Processing Systems, 36:36407–36433, 2023

2023
[52]

Making slow thinking faster: Compressing llm chain-of-thought via step entropy

Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, and Qiang Xu. Making slow thinking faster: Compressing llm chain-of-thought via step entropy. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[53]

Tracing the representation geometry of language models from pretraining to post-training.arXiv preprint arXiv:2509.23024, 2025

Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, and Blake A Richards. Tracing the representation geometry of language models from pretraining to post-training.arXiv preprint arXiv:2509.23024, 2025

work page arXiv 2025
[54]

LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan. Llm reasoning as trajectories: Step-specific representation geometry and correctness signals.arXiv preprint arXiv:2604.05655, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

BERT rediscovers the classical NLP pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of ACL, pages 4593–4601, 2019

2019
[56]

How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of EMNLP-IJCNLP, pages 55–65, 2019. 13

2019
[57]

Truthx: Alleviating hallucinations by editing large language models in truthful space

Shaolei Zhang, Tian Yu, and Yang Feng. Truthx: Alleviating hallucinations by editing large language models in truthful space. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8908–8949, 2024

2024
[58]

Personas as a way to model truthfulness in language models

Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, and He He. Personas as a way to model truthfulness in language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6346–6359, 2024

2024
[59]

Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4791–4797, 2023

2023
[60]

Thomas Savage, John Wang, Robert Gallo, Abdessalem Boukil, Vishwesh Patel, Seyed Amir Ah- mad Safavi-Naini, Ali Soroush, and Jonathan H Chen. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.Journal of the American Medical Informatics Association, 32(1):139–149, 2025

2025
[61]

Robots that ask for help: Uncertainty alignment for large language model planners.arXiv preprint arXiv:2307.01928, 2023

Allen Z Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, et al. Robots that ask for help: Uncertainty alignment for large language model planners.arXiv preprint arXiv:2307.01928, 2023

work page arXiv 2023
[62]

Uncertainty quantification and confidence calibration in large language models: A survey

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6107–6117, 2025

2025
[63]

Discovering latent knowledge in language models without supervision.International Conference on Learning Representations, 2023

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision.International Conference on Learning Representations, 2023

2023
[64]

Linearity of relation decoding in transformer language models.arXiv preprint arXiv:2308.09124, 2023

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models.arXiv preprint arXiv:2308.09124, 2023

work page arXiv 2023
[65]

Rationalizing neural predictions

Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117, 2016

2016
[66]

Sarthak Jain and Byron C. Wallace. Attention is not explanation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3543–3556, 2019

2019
[67]

Attention is not not explanation

Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 11–20, 2019

2019
[68]

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. ERASER: A benchmark to evaluate rationalized NLP models. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, 2020

2020
[69]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, 2020

2020
[70]

Causal scrubbing, a method for rigorously testing interpretability hypotheses

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Ekaterina Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing, a method for rigorously testing interpretability hypotheses. InNeurIPS ML Safety Workshop, 2022

2022
[71]

Amuse: Audio-visual benchmark and alignment framework for agentic multi-speaker understanding.arXiv preprint arXiv:2512.16250, 2025

Sanjoy Chowdhury, Karren D Yang, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manocha, Chun-Liang Li, and Raviteja Vemulapalli. Amuse: Audio-visual benchmark and alignment framework for agentic multi-speaker understanding.arXiv preprint arXiv:2512.16250, 2025

work page arXiv 2025
[72]

14 Egoadapt: Adaptive multisensory distillation and policy learning for efficient egocentric percep- tion

Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock, Ish- warya Ananthabhotla, Yijun Qian, Vamsi Krishna Ithapu, Dinesh Manocha, and Ruohan Gao. 14 Egoadapt: Adaptive multisensory distillation and policy learning for efficient egocentric percep- tion. InProceedings of the IEEE/CVF International Conference on Computer Vision, page...

2025
[73]

Magnet: A multi-agent framework for finding audio-visual needles by reasoning over multi-video haystacks.Advances in Neural Information Processing Systems, 38:49255–49291, 2026

Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, and Dinesh Manocha. Magnet: A multi-agent framework for finding audio-visual needles by reasoning over multi-video haystacks.Advances in Neural Information Processing Systems, 38:49255–49291, 2026

2026
[74]

Aurelia: Test-time reasoning distillation in audio-visual llms

Sanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, and Dinesh Manocha. Aurelia: Test-time reasoning distillation in audio-visual llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22899–22910, 2025

2025
[75]

Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, and Dinesh Manocha. Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1590–1601, 2025

2025
[76]

Meerkat: Audio-visual large language model for grounding in space and time

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, and Dinesh Manocha. Meerkat: Audio-visual large language model for grounding in space and time. InEuropean Conference on Computer Vision, pages 52–70. Springer, 2024

2024
[77]

Melfusion: Synthesizing music from image and language cues using diffusion models

Sanjoy Chowdhury, Sayan Nag, KJ Joseph, Balaji Vasan Srinivasan, and Dinesh Manocha. Melfusion: Synthesizing music from image and language cues using diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26826–26835, 2024

2024
[78]

Apollo: Unified adapter and prompt learning for vision language models

Sanjoy Chowdhury, Sayan Nag, and Dinesh Manocha. Apollo: Unified adapter and prompt learning for vision language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10173–10187, 2023

2023
[79]

Aura: A fine-grained benchmark and decomposed metric for audio-visual reasoning

Siminfar Samakoush Galougah, Rishie Raj, Sanjoy Chowdhury, Sayan Nag, and Ramani Du- raiswami. Aura: A fine-grained benchmark and decomposed metric for audio-visual reasoning. arXiv preprint arXiv:2508.07470, 2025. 15 Appendices A Implementation Details 17 B Proof Details 19 B.1 Existence of Minimal Cores . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page arXiv 2025

[1] [1]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

2022

[2] [2]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, 2022

2022

[3] [3]

Least-to-most prompting enables complex reasoning in large language models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. InInternational Conference on Learning Representations, 2023

2023

[4] [4]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, 2023

2023

[5] [5]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

2023

[6] [6]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

work page arXiv 2025

[8] [8]

Overclocking llm reasoning: Monitoring and controlling thinking path lengths in llms.arXiv preprint arXiv:2506.07240, 2025

Roy Eisenstadt, Itamar Zimerman, and Lior Wolf. Overclocking llm reasoning: Monitoring and controlling thinking path lengths in llms.arXiv preprint arXiv:2506.07240, 2025

work page arXiv 2025

[9] [9]

Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

work page arXiv 2025

[10] [10]

A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond.arXiv preprint arXiv:2503.21614, 2025

Xiaoye Qu, Yafu Li, Zhao-Chen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, et al. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond.arXiv preprint arXiv:2503.21614, 2025

work page arXiv 2025

[11] [11]

Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

work page arXiv 2025

[12] [12]

Don’t overthink it

Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don’t overthink it. preferring shorter thinking chains for improved llm reasoning.arXiv preprint arXiv:2505.17813, 2025

work page arXiv 2025

[13] [13]

Logicreward: Incentivizing llm reasoning via step-wise logical supervision.arXiv preprint arXiv:2512.18196, 2025

Jundong Xu, Hao Fei, Huichi Zhou, Xin Quan, Qijun Huang, Shengqiong Wu, William Yang Wang, Mong-Li Lee, and Wynne Hsu. Logicreward: Incentivizing llm reasoning via step-wise logical supervision.arXiv preprint arXiv:2512.18196, 2025

work page arXiv 2025

[14] [14]

General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025

Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025

work page arXiv 2025

[15] [15]

Code execution as grounded supervision for llm reasoning

Dongwon Jung, Wenxuan Zhou, and Muhao Chen. Code execution as grounded supervision for llm reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24822–24833, 2025

2025

[16] [16]

Reason-to-recommend: Using interaction-of-thought reasoning to enhance llm recommendation.arXiv preprint arXiv:2506.05069, 2025

Keyu Zhao, Fengli Xu, and Yong Li. Reason-to-recommend: Using interaction-of-thought reasoning to enhance llm recommendation.arXiv preprint arXiv:2506.05069, 2025

work page arXiv 2025

[17] [17]

Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling

Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3613–3635, 2025

2025

[18] [18]

A theoretical study on bridging internal probability and self-consistency for llm reasoning.arXiv preprint arXiv:2510.15444, 2025

Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, and Xiaoxing Ma. A theoretical study on bridging internal probability and self-consistency for llm reasoning.arXiv preprint arXiv:2510.15444, 2025. 11

work page arXiv 2025

[19] [19]

Confidence improves self-consistency in llms

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20090–20111, 2025

2025

[20] [20]

Consistent paths lead to truth: Self-rewarding reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.08745, 2025

Kongcheng Zhang, Qi Yao, Shunyu Liu, Yingjie Wang, Baisheng Lai, Jieping Ye, Mingli Song, and Dacheng Tao. Consistent paths lead to truth: Self-rewarding reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.08745, 2025

work page arXiv 2025

[21] [21]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, 2022

2022

[22] [22]

Think twice: Enhancing llm reasoning by scaling multi-round test-time thinking.arXiv preprint arXiv:2503.19855, 2025

Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yunjie Ji, Yiping Peng, Han Zhao, and Xiangang Li. Think twice: Enhancing llm reasoning by scaling multi-round test-time thinking.arXiv preprint arXiv:2503.19855, 2025

work page arXiv 2025

[23] [23]

Atom of thoughts for markov llm test-time scaling.arXiv preprint arXiv:2502.12018, 2025

Fengwei Teng, Quan Shi, Zhaoyang Yu, Jiayi Zhang, Yuyu Luo, Chenglin Wu, and Zhijiang Guo. Atom of thoughts for markov llm test-time scaling.arXiv preprint arXiv:2502.12018, 2025

work page arXiv 2025

[24] [24]

Towards thinking-optimal scaling of test-time compute for llm reasoning.arXiv preprint arXiv:2502.18080, 2025

Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test-time compute for llm reasoning.arXiv preprint arXiv:2502.18080, 2025

work page arXiv 2025

[25] [25]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[26] [26]

LIMO: Less is More for Reasoning

Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Trims: Real-time tracking of minimal sufficient length for efficient reasoning via reinforcement learning.arXiv preprint arXiv:2603.17449, 2026

Yubo Wang, Yuhui Li, Zhiwei Zhang, and Wenhu Chen. Trims: Real-time tracking of minimal sufficient length for efficient reasoning via reinforcement learning.arXiv preprint arXiv:2603.17449, 2026

work page arXiv 2026

[30] [30]

Surgical trimming: Minimal sufficient chain of thought with razorreward-rl.arXiv preprint, 2025

Xinyu Chen, Zihan Liu, Kai Zhang, and Yizhong Wang. Surgical trimming: Minimal sufficient chain of thought with razorreward-rl.arXiv preprint, 2025

2025

[31] [31]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning, pages 3519–3529, 2019

2019

[32] [32]

John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. InProceedings of NAACL-HLT, pages 4129–4138, 2019

2019

[33] [33]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

2022

[34] [34]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of ACL-IJCNLP, pages 7319–7328, 2021

2021

[35] [35]

Inference- time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems, 2023

2023

[36] [36]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. InarXiv preprint arXiv:2110.14168, 2021. 12

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

2021

[39] [39]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. In Transactions of the Association for Computational Linguistics, volume 9, pages 346–361, 2021

2021

[41] [41]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[42] [42]

Rousseeuw

Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.Journal of Computational and Applied Mathematics, 20:53–65, 1987

1987

[43] [43]

A cluster separation measure.IEEE transactions on pattern analysis and machine intelligence, (2):224–227, 1979

David L Davies and Donald W Bouldin. A cluster separation measure.IEEE transactions on pattern analysis and machine intelligence, (2):224–227, 1979

1979

[44] [44]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Ques- tion decomposition improves the faithfulness of model-generated reasoning.arXiv preprint arXiv:2307.11768, 2023

Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Her- nandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, et al. Ques- tion decomposition improves the faithfulness of model-generated reasoning.arXiv preprint arXiv:2307.11768, 2023

work page arXiv 2023

[48] [48]

Chain-of-question: A progressive question decomposition approach for complex knowledge base question answering

Peng Yixing, Quan Wang, Licheng Zhang, Yi Liu, and Zhendong Mao. Chain-of-question: A progressive question decomposition approach for complex knowledge base question answering. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4763–4776, 2024

2024

[49] [49]

Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search

Xinzhe Li. Chain-in-tree: Back to sequential reasoning in llm tree search.arXiv preprint arXiv:2509.25835, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Verify-and-edit: A knowledge-enhanced chain-of-thought framework

Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5823–5840, 2023

2023

[51] [51]

Deductive verification of chain-of-thought reasoning.Advances in Neural Information Processing Systems, 36:36407–36433, 2023

Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning.Advances in Neural Information Processing Systems, 36:36407–36433, 2023

2023

[52] [52]

Making slow thinking faster: Compressing llm chain-of-thought via step entropy

Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, and Qiang Xu. Making slow thinking faster: Compressing llm chain-of-thought via step entropy. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[53] [53]

Tracing the representation geometry of language models from pretraining to post-training.arXiv preprint arXiv:2509.23024, 2025

Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, and Blake A Richards. Tracing the representation geometry of language models from pretraining to post-training.arXiv preprint arXiv:2509.23024, 2025

work page arXiv 2025

[54] [54]

LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan. Llm reasoning as trajectories: Step-specific representation geometry and correctness signals.arXiv preprint arXiv:2604.05655, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[55] [55]

BERT rediscovers the classical NLP pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of ACL, pages 4593–4601, 2019

2019

[56] [56]

How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of EMNLP-IJCNLP, pages 55–65, 2019. 13

2019

[57] [57]

Truthx: Alleviating hallucinations by editing large language models in truthful space

Shaolei Zhang, Tian Yu, and Yang Feng. Truthx: Alleviating hallucinations by editing large language models in truthful space. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8908–8949, 2024

2024

[58] [58]

Personas as a way to model truthfulness in language models

Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, and He He. Personas as a way to model truthfulness in language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6346–6359, 2024

2024

[59] [59]

Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4791–4797, 2023

2023

[60] [60]

Thomas Savage, John Wang, Robert Gallo, Abdessalem Boukil, Vishwesh Patel, Seyed Amir Ah- mad Safavi-Naini, Ali Soroush, and Jonathan H Chen. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.Journal of the American Medical Informatics Association, 32(1):139–149, 2025

2025

[61] [61]

Robots that ask for help: Uncertainty alignment for large language model planners.arXiv preprint arXiv:2307.01928, 2023

Allen Z Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, et al. Robots that ask for help: Uncertainty alignment for large language model planners.arXiv preprint arXiv:2307.01928, 2023

work page arXiv 2023

[62] [62]

Uncertainty quantification and confidence calibration in large language models: A survey

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6107–6117, 2025

2025

[63] [63]

Discovering latent knowledge in language models without supervision.International Conference on Learning Representations, 2023

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision.International Conference on Learning Representations, 2023

2023

[64] [64]

Linearity of relation decoding in transformer language models.arXiv preprint arXiv:2308.09124, 2023

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models.arXiv preprint arXiv:2308.09124, 2023

work page arXiv 2023

[65] [65]

Rationalizing neural predictions

Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117, 2016

2016

[66] [66]

Sarthak Jain and Byron C. Wallace. Attention is not explanation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3543–3556, 2019

2019

[67] [67]

Attention is not not explanation

Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 11–20, 2019

2019

[68] [68]

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. ERASER: A benchmark to evaluate rationalized NLP models. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, 2020

2020

[69] [69]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, 2020

2020

[70] [70]

Causal scrubbing, a method for rigorously testing interpretability hypotheses

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Ekaterina Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing, a method for rigorously testing interpretability hypotheses. InNeurIPS ML Safety Workshop, 2022

2022

[71] [71]

Amuse: Audio-visual benchmark and alignment framework for agentic multi-speaker understanding.arXiv preprint arXiv:2512.16250, 2025

Sanjoy Chowdhury, Karren D Yang, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manocha, Chun-Liang Li, and Raviteja Vemulapalli. Amuse: Audio-visual benchmark and alignment framework for agentic multi-speaker understanding.arXiv preprint arXiv:2512.16250, 2025

work page arXiv 2025

[72] [72]

14 Egoadapt: Adaptive multisensory distillation and policy learning for efficient egocentric percep- tion

Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock, Ish- warya Ananthabhotla, Yijun Qian, Vamsi Krishna Ithapu, Dinesh Manocha, and Ruohan Gao. 14 Egoadapt: Adaptive multisensory distillation and policy learning for efficient egocentric percep- tion. InProceedings of the IEEE/CVF International Conference on Computer Vision, page...

2025

[73] [73]

Magnet: A multi-agent framework for finding audio-visual needles by reasoning over multi-video haystacks.Advances in Neural Information Processing Systems, 38:49255–49291, 2026

Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, and Dinesh Manocha. Magnet: A multi-agent framework for finding audio-visual needles by reasoning over multi-video haystacks.Advances in Neural Information Processing Systems, 38:49255–49291, 2026

2026

[74] [74]

Aurelia: Test-time reasoning distillation in audio-visual llms

Sanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, and Dinesh Manocha. Aurelia: Test-time reasoning distillation in audio-visual llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22899–22910, 2025

2025

[75] [75]

Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, and Dinesh Manocha. Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1590–1601, 2025

2025

[76] [76]

Meerkat: Audio-visual large language model for grounding in space and time

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, and Dinesh Manocha. Meerkat: Audio-visual large language model for grounding in space and time. InEuropean Conference on Computer Vision, pages 52–70. Springer, 2024

2024

[77] [77]

Melfusion: Synthesizing music from image and language cues using diffusion models

Sanjoy Chowdhury, Sayan Nag, KJ Joseph, Balaji Vasan Srinivasan, and Dinesh Manocha. Melfusion: Synthesizing music from image and language cues using diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26826–26835, 2024

2024

[78] [78]

Apollo: Unified adapter and prompt learning for vision language models

Sanjoy Chowdhury, Sayan Nag, and Dinesh Manocha. Apollo: Unified adapter and prompt learning for vision language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10173–10187, 2023

2023

[79] [79]

Aura: A fine-grained benchmark and decomposed metric for audio-visual reasoning

Siminfar Samakoush Galougah, Rishie Raj, Sanjoy Chowdhury, Sayan Nag, and Ramani Du- raiswami. Aura: A fine-grained benchmark and decomposed metric for audio-visual reasoning. arXiv preprint arXiv:2508.07470, 2025. 15 Appendices A Implementation Details 17 B Proof Details 19 B.1 Existence of Minimal Cores . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page arXiv 2025