arxiv: 2604.18567 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

Manan Gupta , Dhruv Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords inference time error correctionresidual stream monitoringKV cache steeringphase shift detectionmathematical reasoninglarge language modelsautoregressive generation

0 comments

The pith

Large language models can correct unrecoverable reasoning errors at inference by monitoring residual stream phase shifts and steering the KV cache.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors propose Latent Phase-Shift Rollback as a way to catch mistakes inside a language model before they compound. At each step they check the residual stream at a chosen layer for sudden changes in direction using cosine similarity and entropy. When a shift is flagged they roll back the cache to a prior state and add a steering vector to guide the next tokens. This raises performance on difficult math problems from 28.8 percent to 44 percent for an eight billion parameter model and beats both self-correction prompts and heavy sampling methods. The work also finds that the layer best for spotting errors differs from the layer best for fixing them.

Core claim

By monitoring the residual stream at a critical layer for abrupt directional reversals detected through a cosine-similarity and entropy dual gate, Latent Phase-Shift Rollback rolls back the KV-cache and injects a pre-computed steering vector to correct errors during generation, delivering 44.0% accuracy on MATH-500 with an 8B model compared to 28.8% for standard autoregressive decoding.

What carries the argument

Latent Phase-Shift Rollback, which uses a dual gate on cosine similarity and entropy at layer lcrit to detect phase shifts in the residual stream, followed by KV-cache rollback and steering vector injection.

If this is right

Math accuracy improves by 15.2 percentage points on MATH-500 for 8B models.
Prompted self-correction is outperformed by 24.2 points despite being a natural baseline.
Token cost is 5.4 times lower than Best-of-16 sampling while achieving higher accuracy.
Performance exceeds that of a 70B model with 8.75 times fewer parameters.
The optimal layer for error detection differs from the optimal layer for correction accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method implies that error signals appear in the internal activations before they affect the final output tokens.
It could be extended to other domains like code generation or logical reasoning by identifying suitable critical layers.
Adaptive selection of the monitoring layer during generation might further improve results.
Combining this with other inference techniques could yield even larger gains at modest extra cost.

Load-bearing premise

That the detected abrupt directional reversals at the critical layer correspond to unrecoverable reasoning errors that the steering vector can reliably correct.

What would settle it

Comparing performance when steering is applied at randomly chosen steps versus only at detected phase shifts; if random application matches or exceeds the gated version, the specific detection mechanism would be shown unnecessary.

Figures

Figures reproduced from arXiv: 2604.18567 by Dhruv Kumar, Manan Gupta.

**Figure 1.** Figure 1: Main results and scaling. (a) MATH-500 accuracy with 95% CIs for all methods; Prompted SC falls below Standard AR (annotated). (b) Accuracy vs. model scale: LPSR (8B) exceeds 70B Standard AR by +8.8 pp using 8.75× fewer parameters. (8.3% of problems vs. 62% on MATH-500). On AIME (n=60), Standard AR, Best-of-16, and LPSR all score 8.3%; STIR (1.7%) and CoCoNuT (6.7%) fall below, so LPSR is not harmed. The A… view at source ↗

**Figure 2.** Figure 2: Accuracy and gain by difficulty level. (a) MATH-500 accuracy at each difficulty level (1–5) for all methods. (b) LPSR gain over Standard AR per level: gain peaks at Level 3 (+20.0 pp) and is smallest at Level 2 (+11.1 pp); gains are consistent across all levels. AR CCN STIR Prom. SC BoN-16 LPSR (ours) 0% 10% 20% 30% 40% 50% Accuracy 29% 26% 29% 20% 36% 44% below AR 70B AR (a) MATH-500 AR CCN STIR Prom. SC … view at source ↗

**Figure 3.** Figure 3: Left: Baseline comparison. Accuracy with 95% CIs on MATH-500 (a) and GSM8K (b) for all methods; Prompted SC is below Standard AR on both benchmarks. Right: Accuracy–compute Pareto frontier. LPSR is Pareto-optimal on both token budget (a) and relative compute (b) axes; Best-of-16 uses ∼5.3× more tokens for lower accuracy; Prompted SC is anti-Pareto. STIR-Static does not help. Applying a fixed steering vecto… view at source ↗

**Figure 4.** Figure 4: Layer sensitivity sweep. (a) Error-detection AUC across all 32 transformer layers; the high-sensitivity region spans layers 8–18, with peak AUC at ℓ ∗=14 (0.718). (b) Detection AUC vs. task accuracy tradeoff: ℓcrit=16 (AUC 0.652, accuracy 44.0%) outperforms the peak-AUC layer ℓ ∗=14 (AUC 0.718, accuracy 29.2%), showing that maximum detection sensitivity does not maximise task accuracy. • Detection AUC peak… view at source ↗

**Figure 5.** Figure 5: Rollback timing and outcome. (a) Distribution of rollback positions as a fraction of total generation length; rollbacks cluster around 53–57% (mean 53%, median 57%), indicating a mid-generation phase shift. (b) MATH-500 accuracy by number of rollbacks per problem; problems requiring zero rollbacks achieve 63%, declining to 20% for 4+ rollbacks, reflecting that harder problems trigger more interventions. 6 … view at source ↗

**Figure 6.** Figure 6: LPSR per-step control flow. Blue = forward pass; red = detection; yellow = [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Steering vector basis analysis. Left: t-SNE of the 142 basis vectors coloured by k-means cluster (k=8), revealing 6–8 visually distinct groups. Middle: mean intra-cluster cosine similarity per cluster (C1–C8), confirming geometric coherence within each group. Right: PCA explained variance of the steering basis; the top 10 components capture ≈7% of variance (cumulative shown in orange). D Hyperparameter Sen… view at source ↗

**Figure 8.** Figure 8: LPSR gain by subject area. LPSR gain over Standard AR on MATH-500, broken down by subject. Geometry benefits most (+34.2 pp, n=41); Precalculus benefits least (+7.1 pp, n=56). The dashed line marks the overall gain of +15.2 pp [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Dual-gate detector performance (held-out 200 problems). Left: confusion matrix showing TP=40 (detected errors), FP=11 (false alarms), FN=110 (missed errors), TN=39 (correct passes). Right: summary metrics—precision 0.784, recall 0.267, F1 = 0.398, FPR 0.220. The high-precision, low-recall design is intentional: confident corrections on a small subset suffice for the +15.2 pp overall gain. F Qualitative Err… view at source ↗

read the original abstract

Large language models frequently commit unrecoverable reasoning errors mid-generation: once a wrong step is taken, subsequent tokens compound the mistake rather than correct it. We introduce $\textbf{Latent Phase-Shift Rollback}$ (LPSR): at each generation step, we monitor the residual stream at a critical layer lcrit, detect abrupt directional reversals (phase shifts) via a cosine-similarity $+$ entropy dual gate, and respond by rolling back the KV-cache and injecting a pre-computed steering vector. No fine-tuning, gradient computation, or additional forward passes are required. LPSR achieves $\mathbf{44.0\%}$ on MATH-500 with an 8B model versus $28.8\%$ for standard AR ($+15.2$ pp; McNemar $\chi^2 = 66.96$, $p < 10^{-15}$). Critically, prompted self-correction, the most natural inference-time baseline, scores only $19.8\%$, below standard AR; LPSR exceeds it by $+24.2$ pp ($\chi^2 = 89.4$, $p \approx 0$). LPSR also outperforms Best-of-16 ($+7.8$ pp) at $5.4\times$ lower token cost, and surpasses a standard 70B model ($35.2\%$) with $8.75\times$ fewer parameters at ${\sim}3\times$ the token budget. A 32-layer sweep reveals a novel \textbf{detection-correction dissociation}: error-detection AUC peaks at layer~14 ($0.718$) but task accuracy peaks at layer~16 ($44.0\%$ vs.\ $29.2\%$), demonstrating that optimal monitoring depth differs for detection and correction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LPSR reports a 15-point gain on MATH-500 for an 8B model via residual monitoring and rollback, but lacks direct checks that detections hit actual errors or that steering fixes them causally.

read the letter

The main point is that this method watches the residual stream at one layer for directional reversals using a cosine-plus-entropy gate, rolls back the KV cache on trigger, and injects a pre-computed steering vector to steer the generation back on track. No training or extra passes are needed, and the abstract shows 44% accuracy on MATH-500 versus 28.8% for plain autoregressive decoding, plus beats on best-of-16 and a 70B model at lower cost. A layer sweep finds detection AUC highest at layer 14 but final accuracy highest at layer 16, which is a clean dissociation result. The specific pairing of that dual gate with rollback and steering looks like a new combination relative to prior activation-steering work. The numbers come with McNemar tests and the approach stays fully empirical with no circular fitting. The soft spots sit in the missing causal checks. The abstract gives only aggregate scores and does not include per-instance traces, human labels on triggered steps, or ablations that isolate whether non-error trajectories stay untouched. The steering vector construction details and exact threshold values are also absent, leaving open whether the lift comes from targeted correction or from rollback simply adding output diversity. The prompted self-correction baseline landing at 19.8% below standard decoding is odd and could point to an implementation detail that interacts with the gate. This work is aimed at engineers who want cheap inference-time reliability boosts on reasoning tasks without retraining. A reader already following activation engineering or test-time scaling would find the layer dissociation and cost comparisons useful to discuss. The paper deserves peer review because the performance claims are specific enough to test directly and the method is simple to reproduce once the full implementation details are supplied.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Latent Phase-Shift Rollback (LPSR), an inference-time technique for correcting unrecoverable reasoning errors in LLMs. It monitors the residual stream at a critical layer l_crit for abrupt directional reversals using a cosine-similarity and entropy-based dual gate. Upon detection, it rolls back the KV-cache and injects a pre-computed steering vector. The method requires no fine-tuning or additional forward passes. On the MATH-500 benchmark, LPSR achieves 44.0% accuracy with an 8B model, compared to 28.8% for standard autoregressive decoding (+15.2 pp), outperforming prompted self-correction (19.8%) and Best-of-16 sampling, while also surpassing a 70B model.

Significance. If the mechanistic interpretation holds, the results would be significant for inference-time scaling and error correction in LLMs. The approach offers substantial gains on mathematical reasoning tasks without training or extra forward passes, with reported efficiency advantages over sampling methods. The observed dissociation between optimal layers for detection (layer 14, AUC 0.718) and correction (layer 16, 44.0% accuracy) is a novel empirical finding. The inclusion of statistical tests (McNemar chi-squared with p-values) and a 32-layer sweep strengthens the empirical component.

major comments (2)

[Abstract] The central performance claim (Abstract) that the cosine-similarity + entropy dual gate at l_crit specifically detects unrecoverable reasoning errors is not supported by direct evidence such as per-instance tracing, human annotation of triggered steps, or contrastive residual-stream analysis between error and non-error trajectories. Without this, the +15.2 pp lift could arise from rollback-induced diversity or untargeted vector injection rather than causal correction.
[Results] The results section reports aggregate accuracy and a layer dissociation but provides no ablation showing that non-triggered generations remain unaffected or that the gate distinguishes error from non-error steps. This is load-bearing for the claim that LPSR performs targeted error correction rather than generic intervention.

minor comments (2)

[Abstract] The notation for the critical layer is given as lcrit in the abstract text but would benefit from consistent mathematical formatting (e.g., l_{crit}) throughout.
[Methods] Exact construction details for the pre-computed steering vector and the precise threshold values in the dual gate are referenced but would benefit from explicit equations or pseudocode to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger causal evidence on the specificity of LPSR's error detection. We address each major comment below and commit to revisions that add the requested analyses without altering the core claims or results.

read point-by-point responses

Referee: [Abstract] The central performance claim (Abstract) that the cosine-similarity + entropy dual gate at l_crit specifically detects unrecoverable reasoning errors is not supported by direct evidence such as per-instance tracing, human annotation of triggered steps, or contrastive residual-stream analysis between error and non-error trajectories. Without this, the +15.2 pp lift could arise from rollback-induced diversity or untargeted vector injection rather than causal correction.

Authors: We agree that per-instance tracing, human annotation of triggered steps, and explicit contrastive residual-stream analysis would provide stronger direct evidence for the gate's specificity to unrecoverable errors. The manuscript currently supports the claim indirectly via the statistically significant accuracy lift (McNemar χ² = 66.96, p < 10^{-15}), outperformance over prompted self-correction and Best-of-16, and the novel detection-correction layer dissociation (AUC peak at layer 14 vs. accuracy peak at layer 16). These elements are inconsistent with purely generic diversity or untargeted injection. In revision we will add a contrastive analysis of residual-stream trajectories on error vs. non-error steps and report intervention rates broken down by correctness. revision: yes
Referee: [Results] The results section reports aggregate accuracy and a layer dissociation but provides no ablation showing that non-triggered generations remain unaffected or that the gate distinguishes error from non-error steps. This is load-bearing for the claim that LPSR performs targeted error correction rather than generic intervention.

Authors: The current results section indeed lacks an explicit ablation confirming that non-triggered generations are unaffected and does not directly compare gate activation statistics between error and non-error steps. We acknowledge this limits the strength of the targeted-correction interpretation. In the revised manuscript we will include (1) an ablation measuring accuracy on the subset of generations where the dual gate never triggers (to verify no unintended degradation) and (2) gate-trigger statistics and activation histograms conditioned on whether the final answer is correct or incorrect. These additions will directly address whether the intervention is selective. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical inference-time method validated on external benchmarks

full rationale

The paper introduces LPSR as a procedural inference-time technique involving residual-stream monitoring with a cosine-similarity plus entropy gate, KV-cache rollback, and injection of a pre-computed steering vector. All performance claims (e.g., 44.0% on MATH-500) are obtained via direct benchmarking against independent baselines and datasets, with no equations, derivations, or self-citations that reduce the reported gains to fitted inputs or definitional tautologies by construction. The 32-layer sweep and dissociation findings are likewise empirical observations, not self-referential.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 1 invented entities

The method rests on empirical choices for the monitoring layer and detection thresholds plus the assumption that phase shifts are detectable error signals; no theoretical derivation is provided.

free parameters (2)

critical layer lcrit
Selected via 32-layer sweep; accuracy peaks at layer 16 while detection AUC peaks at layer 14.
cosine-similarity and entropy thresholds
Dual-gate parameters used to flag phase shifts; values not stated in abstract but required for the detection logic.

invented entities (1)

latent phase shift no independent evidence
purpose: Abrupt directional reversal in residual stream taken as proxy for unrecoverable reasoning error
Introduced as the detectable signal that triggers rollback; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5633 in / 1329 out tokens · 52637 ms · 2026-05-10T05:04:49.162660+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Softmax linear units

Anthropic . Softmax linear units. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/solu/index.html

2022
[2]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In AAAI Conference on Artificial Intelligence, 2024

2024
[3]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. In Advances in Neural Information Processing Systems, 2021

2021
[4]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Saurav Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah, and J...

2021
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In International Conference on Learning Representations, 2024

2024
[8]

Probability in Banach Spaces: Isoperimetry and Processes

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry and Processes . Springer, Berlin, Heidelberg, 1991

1991
[9]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

2023
[10]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations, 2024

2024
[11]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Boaz Dolan, Swaroop He, Gerard De Melo, Peter Clark, Antoine Bosselut, and Ashish Sabharwal. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems...

2023
[12]

The expressive power of transformers with chain of thought

William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. In International Conference on Learning Representations, 2024

2024
[14]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. In International Conference on Machine Learning, 2024

2024
[16]

Scaling LLM test-time compute optimally can be more effective than scaling model parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. In International Conference on Learning Representations, 2025

2025
[17]

Activation addition: Steering language models without optimization

Alex Turner, Lisa Thiergart, Gavin Udell, David Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. In Advances in Neural Information Processing Systems, volume 36, 2023

2023
[18]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback. In Advances in Neural Information Processing Systems, 2022

2022
[19]

Large language models still can't plan (a benchmark for LLMs on planning and reasoning about change)

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can't plan (a benchmark for LLMs on planning and reasoning about change). In Advances in Neural Information Processing Systems Workshop on Foundation Models for Decision Making, 2023

2023
[20]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations, 2023

2023
[21]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pp.\ 24824--24837. Curran Associates, Inc., 2022

2022
[22]

Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable ap- proaches

Jiayi Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qiyuan Feng, Haoming Xu, Shaochen Lian, Zheng Jiang, Zhengmian Hu, Xia Hu, et al. KV cache compression, but what must we give in return? A comprehensive benchmark of long context capable approaches. arXiv preprint arXiv:2407.01527, 2024

work page arXiv 2024
[23]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, 2023

2023
[24]

Representation engineering: A top-down approach to AI transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwath Goel, Nathaniel Li, Michael J Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to AI...

2024
[25]

Advances in Neural Information Processing Systems , volume =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , publisher =

2022
[26]

International Conference on Learning Representations , year =

Let's Verify Step by Step , author =. International Conference on Learning Representations , year =
[27]

Advances in Neural Information Processing Systems , year =

Solving Math Word Problems with Process- and Outcome-Based Feedback , author =. Advances in Neural Information Processing Systems , year =
[28]

International Conference on Learning Representations , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. International Conference on Learning Representations , year =
[29]

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , booktitle =. Scaling
[30]

Advances in Neural Information Processing Systems , volume =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. Advances in Neural Information Processing Systems , volume =
[31]

AAAI Conference on Artificial Intelligence , year =

Graph of Thoughts: Solving Elaborate Problems with Large Language Models , author =. AAAI Conference on Artificial Intelligence , year =
[32]

Transformer Circuits Thread , year =

A Mathematical Framework for Transformer Circuits , author =. Transformer Circuits Thread , year =
[33]

Transformer Circuits Thread , year =

Softmax Linear Units , author =. Transformer Circuits Thread , year =
[34]

Representation Engineering: A Top-Down Approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwath and Li, Nathaniel and Byun, Michael J and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J Z...
[35]

Advances in Neural Information Processing Systems , volume =

Activation Addition: Steering Language Models Without Optimization , author =. Advances in Neural Information Processing Systems , volume =
[36]

International Conference on Machine Learning , year =

The Linear Representation Hypothesis and the Geometry of Large Language Models , author =. International Conference on Machine Learning , year =
[37]

International Conference on Machine Learning , pages =

Fast Inference from Transformers via Speculative Decoding , author =. International Conference on Machine Learning , pages =. 2023 , publisher =

2023
[38]

Yang, Jiayi and Jin, Hongye and Tang, Ruixiang and Han, Xiaotian and Feng, Qiyuan and Xu, Haoming and Lian, Shaochen and Jiang, Zheng and Hu, Zhengmian and Hu, Xia and others , journal =
[39]

Advances in Neural Information Processing Systems , volume =

Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems , volume =
[40]

International Conference on Learning Representations , year =

Large Language Models Cannot Self-Correct Reasoning Yet , author =. International Conference on Learning Representations , year =
[41]

Training Large Language Models to Reason in a Continuous Latent Space

Training Large Language Models to Reason in a Continuous Latent Space , author =. arXiv preprint arXiv:2412.06769 , year =

work page internal anchor Pith review arXiv
[42]

The Llama 3 Herd of Models

The. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Advances in Neural Information Processing Systems , year =

Training Verifiers to Solve Math Word Problems , author =. Advances in Neural Information Processing Systems , year =
[44]

Probability in

Ledoux, Michel and Talagrand, Michel , year =. Probability in
[45]

International Conference on Learning Representations , year =

The Expressive Power of Transformers with Chain of Thought , author =. International Conference on Learning Representations , year =
[46]

Large Language Models Still Can't Plan (A Benchmark for

Valmeekam, Karthik and Marquez, Matthew and Sreedharan, Sarath and Kambhampati, Subbarao , booktitle =. Large Language Models Still Can't Plan (A Benchmark for
[47]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others , journal =
[48]

The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity , author =. arXiv preprint arXiv:2506.06941 , year =

work page arXiv