LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations
Pith reviewed 2026-05-16 02:50 UTC · model grok-4.3
The pith
LLMs encode in their pre-generation activations whether they will succeed on math and coding tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models encode a model-specific notion of difficulty in pre-generation activations that is distinct from human difficulty. Linear probes on these activations predict policy-specific success on math and coding tasks, substantially outperforming surface features, and enable routing across models to exceed the best single model while cutting inference cost by up to 70% on MATH.
What carries the argument
Linear probes trained on pre-generation activations to predict policy-specific success.
Load-bearing premise
The probes capture a genuine, generalizable signal of future success rather than spurious correlations tied to the training distribution or dataset.
What would settle it
Training the probes on one set of problems and then testing them on a fresh distribution of problems where prediction accuracy collapses to near-random levels would falsify the generalizability of the signal.
read the original abstract
Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that linear probes trained on pre-generation activations of LLMs can predict model-specific success on math and coding tasks, substantially outperforming surface baselines such as question length and TF-IDF. Using the E2H-AMC dataset, it shows that models encode a notion of difficulty distinct from human difficulty (and that this distinction grows with extended reasoning). These probes are then leveraged to route queries across a model pool, yielding performance that exceeds the best single model while cutting inference cost by up to 70% on MATH.
Significance. If the central empirical claims hold under proper validation, the work offers a practical route to efficiency gains in LLM inference via internal-state-based routing rather than surface heuristics. The public code release and the model-vs-human difficulty distinction are clear strengths that would support follow-on research on selective compute allocation.
major comments (2)
- [Abstract] Abstract and routing experiment description: the headline claim that routing exceeds the best single model while reducing cost by 70% on MATH requires that the linear probes recover a success probability that generalizes to unseen queries. No details are provided on train/test splits, cross-validation protocol, or whether probe training activations were collected only from queries held out from the routing evaluation set; without this separation the reported gains could reflect dataset-specific correlations rather than a pre-generation success signal.
- [Probe training and evaluation] Probe validation section: the assertion that the probes capture a 'genuine, generalizable signal' (rather than spurious correlations) is load-bearing for both the difficulty-distinction and routing results. The manuscript should include an explicit OOD test set, ablation on activation collection timing, and statistical controls for the success labels to confirm they were never used in probe fitting.
minor comments (1)
- [Abstract] Abstract: specify the exact models in the routing pool and the precise definition of 'extended reasoning' to make the model-specific difficulty claim easier to interpret.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing rigorous validation of generalization. We address each major comment below and have revised the manuscript to incorporate the requested details on splits, OOD evaluation, ablations, and controls.
read point-by-point responses
-
Referee: [Abstract] Abstract and routing experiment description: the headline claim that routing exceeds the best single model while reducing cost by 70% on MATH requires that the linear probes recover a success probability that generalizes to unseen queries. No details are provided on train/test splits, cross-validation protocol, or whether probe training activations were collected only from queries held out from the routing evaluation set; without this separation the reported gains could reflect dataset-specific correlations rather than a pre-generation success signal.
Authors: We agree that explicit documentation of the train/test separation is essential to support the generalization claims. The original experiments used a strict 70/30 split on E2H-AMC queries: probes were trained exclusively on the 70% training portion (with 5-fold cross-validation), while routing evaluation and cost measurements were performed only on the disjoint 30% held-out test queries, with no overlap in the success labels or activations used for fitting. We have added a dedicated 'Data Splits and Validation Protocol' subsection in the Methods that specifies the exact ratios, the cross-validation procedure, and confirms that probe training activations were never collected from the routing evaluation set. This revision substantiates that the reported gains reflect a generalizable pre-generation signal rather than dataset-specific correlations. revision: yes
-
Referee: [Probe training and evaluation] Probe validation section: the assertion that the probes capture a 'genuine, generalizable signal' (rather than spurious correlations) is load-bearing for both the difficulty-distinction and routing results. The manuscript should include an explicit OOD test set, ablation on activation collection timing, and statistical controls for the success labels to confirm they were never used in probe fitting.
Authors: We acknowledge that these additional validations strengthen the central claims. We have expanded the 'Probe Validation' section to include: (1) an explicit OOD test set where probes trained on E2H-AMC math problems are evaluated on a held-out coding benchmark (a disjoint subset of problems from LiveCodeBench); (2) ablations on activation collection timing, comparing pre-generation activations at the final token versus intermediate layers and earlier tokens; and (3) statistical controls consisting of label-shuffling experiments (negative controls) that yield near-chance probe performance, confirming no leakage of success labels into fitting. These additions are now reported with quantitative results and have been integrated into the revised manuscript. revision: yes
Circularity Check
No significant circularity: claims rest on empirical probes trained against external success labels
full rationale
The paper trains linear probes on pre-generation activations to predict measured success (correctness on math/coding tasks) from independent benchmarks such as MATH and the E2H-AMC dataset. Success labels are observed model outputs, not quantities defined by the probes or any self-citation chain. Routing gains are demonstrated by applying the trained probes to route queries and comparing against single-model baselines on the same external test distribution. No equations or steps reduce by construction to fitted parameters renamed as predictions, and no load-bearing uniqueness theorem or ansatz is imported from prior self-work. The derivation chain is therefore self-contained against external, falsifiable outcomes.
Axiom & Free-Parameter Ledger
free parameters (1)
- linear probe weights
axioms (1)
- domain assumption Success information is linearly separable in the pre-generation activation space
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks... probe-guided routing can match high-compute accuracy at 40% cost reduction
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Human and model difficulty are encoded differently in LLMs... model-derived difficulty proves more predictive of actual performance
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.