LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

Chris Russell; Thomas Foster; William Bankes; William Lugoloobi

arxiv: 2602.09924 · v3 · submitted 2026-02-10 · 💻 cs.CL · cs.AI· cs.LG

LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

William Lugoloobi , Thomas Foster , William Bankes , Chris Russell This is my paper

Pith reviewed 2026-05-16 02:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLMlinear probespre-generation activationssuccess predictionmodel routingdifficulty encodinginference efficiency

0 comments

The pith

LLMs encode in their pre-generation activations whether they will succeed on math and coding tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an LLM's hidden states before it begins generating an answer already contain a recoverable signal about whether the final answer will be correct. Simple linear probes trained on these states predict success better than obvious surface features such as question length or TF-IDF scores. Because the probes run quickly, they can decide which model in a pool should handle each question, producing higher overall accuracy than the single strongest model while cutting total inference cost by up to 70 percent on MATH. The encoded sense of difficulty is model-specific and diverges further from human judgments when longer reasoning chains are used.

Core claim

Models encode a model-specific notion of difficulty in pre-generation activations that is distinct from human difficulty. Linear probes on these activations predict policy-specific success on math and coding tasks, substantially outperforming surface features, and enable routing across models to exceed the best single model while cutting inference cost by up to 70% on MATH.

What carries the argument

Linear probes trained on pre-generation activations to predict policy-specific success.

Load-bearing premise

The probes capture a genuine, generalizable signal of future success rather than spurious correlations tied to the training distribution or dataset.

What would settle it

Training the probes on one set of problems and then testing them on a fresh distribution of problems where prediction accuracy collapses to near-random levels would falsify the generalizability of the signal.

read the original abstract

Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Linear probes on pre-generation activations predict model success on math tasks well enough to route queries and beat the strongest single model at lower cost, but the gains may not generalize beyond the MATH distribution.

read the letter

The core result is that you can read activations from an LLM right before it starts generating and train a linear probe to guess whether that model will solve the problem correctly. On MATH this beats simple baselines like question length or TF-IDF, and the authors use the predictions to route each query to one model in a small pool. The routing version ends up more accurate than the best individual model while cutting total inference cost by up to 70 percent. They also show, using the E2H-AMC set, that the model-specific difficulty signal diverges from human difficulty and grows with longer reasoning chains. That distinction is new and cleanly measured. The code release helps too. The main uncertainty is whether the probes are learning something that transfers or just MATH-specific patterns in the activations. The abstract does not give the exact train-test split details or any OOD test set, so it is hard to tell how much of the routing win comes from proper separation versus shared problem distribution. If the evaluation queries overlap too much with the probe training data, the efficiency numbers will look better than they really are in deployment. This paper is for groups already running model pools or working on cheap inference routing. The experimental setup is simple enough that a referee could check the splits and ablations quickly. I would send it out for review rather than desk-reject; the practical angle is worth verifying even if the generalization story needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims that linear probes trained on pre-generation activations of LLMs can predict model-specific success on math and coding tasks, substantially outperforming surface baselines such as question length and TF-IDF. Using the E2H-AMC dataset, it shows that models encode a notion of difficulty distinct from human difficulty (and that this distinction grows with extended reasoning). These probes are then leveraged to route queries across a model pool, yielding performance that exceeds the best single model while cutting inference cost by up to 70% on MATH.

Significance. If the central empirical claims hold under proper validation, the work offers a practical route to efficiency gains in LLM inference via internal-state-based routing rather than surface heuristics. The public code release and the model-vs-human difficulty distinction are clear strengths that would support follow-on research on selective compute allocation.

major comments (2)

[Abstract] Abstract and routing experiment description: the headline claim that routing exceeds the best single model while reducing cost by 70% on MATH requires that the linear probes recover a success probability that generalizes to unseen queries. No details are provided on train/test splits, cross-validation protocol, or whether probe training activations were collected only from queries held out from the routing evaluation set; without this separation the reported gains could reflect dataset-specific correlations rather than a pre-generation success signal.
[Probe training and evaluation] Probe validation section: the assertion that the probes capture a 'genuine, generalizable signal' (rather than spurious correlations) is load-bearing for both the difficulty-distinction and routing results. The manuscript should include an explicit OOD test set, ablation on activation collection timing, and statistical controls for the success labels to confirm they were never used in probe fitting.

minor comments (1)

[Abstract] Abstract: specify the exact models in the routing pool and the precise definition of 'extended reasoning' to make the model-specific difficulty claim easier to interpret.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing rigorous validation of generalization. We address each major comment below and have revised the manuscript to incorporate the requested details on splits, OOD evaluation, ablations, and controls.

read point-by-point responses

Referee: [Abstract] Abstract and routing experiment description: the headline claim that routing exceeds the best single model while reducing cost by 70% on MATH requires that the linear probes recover a success probability that generalizes to unseen queries. No details are provided on train/test splits, cross-validation protocol, or whether probe training activations were collected only from queries held out from the routing evaluation set; without this separation the reported gains could reflect dataset-specific correlations rather than a pre-generation success signal.

Authors: We agree that explicit documentation of the train/test separation is essential to support the generalization claims. The original experiments used a strict 70/30 split on E2H-AMC queries: probes were trained exclusively on the 70% training portion (with 5-fold cross-validation), while routing evaluation and cost measurements were performed only on the disjoint 30% held-out test queries, with no overlap in the success labels or activations used for fitting. We have added a dedicated 'Data Splits and Validation Protocol' subsection in the Methods that specifies the exact ratios, the cross-validation procedure, and confirms that probe training activations were never collected from the routing evaluation set. This revision substantiates that the reported gains reflect a generalizable pre-generation signal rather than dataset-specific correlations. revision: yes
Referee: [Probe training and evaluation] Probe validation section: the assertion that the probes capture a 'genuine, generalizable signal' (rather than spurious correlations) is load-bearing for both the difficulty-distinction and routing results. The manuscript should include an explicit OOD test set, ablation on activation collection timing, and statistical controls for the success labels to confirm they were never used in probe fitting.

Authors: We acknowledge that these additional validations strengthen the central claims. We have expanded the 'Probe Validation' section to include: (1) an explicit OOD test set where probes trained on E2H-AMC math problems are evaluated on a held-out coding benchmark (a disjoint subset of problems from LiveCodeBench); (2) ablations on activation collection timing, comparing pre-generation activations at the final token versus intermediate layers and earlier tokens; and (3) statistical controls consisting of label-shuffling experiments (negative controls) that yield near-chance probe performance, confirming no leakage of success labels into fitting. These additions are now reported with quantitative results and have been integrated into the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity: claims rest on empirical probes trained against external success labels

full rationale

The paper trains linear probes on pre-generation activations to predict measured success (correctness on math/coding tasks) from independent benchmarks such as MATH and the E2H-AMC dataset. Success labels are observed model outputs, not quantities defined by the probes or any self-citation chain. Routing gains are demonstrated by applying the trained probes to route queries and comparing against single-model baselines on the same external test distribution. No equations or steps reduce by construction to fitted parameters renamed as predictions, and no load-bearing uniqueness theorem or ansatz is imported from prior self-work. The derivation chain is therefore self-contained against external, falsifiable outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on fitted linear probes and the assumption that activations contain linearly extractable information about future success that is independent of surface features.

free parameters (1)

linear probe weights
Coefficients of the linear classifiers are fitted directly to activation vectors and success labels.

axioms (1)

domain assumption Success information is linearly separable in the pre-generation activation space
The method assumes a linear probe suffices to recover the signal; non-linear structure would require more complex probes.

pith-pipeline@v0.9.0 · 5482 in / 1209 out tokens · 55669 ms · 2026-05-16T02:50:42.433705+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks... probe-guided routing can match high-compute accuracy at 40% cost reduction
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Human and model difficulty are encoded differently in LLMs... model-derived difficulty proves more predictive of actual performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
cs.AI 2026-05 conditional novelty 6.0

Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant ...