arxiv: 2604.06377 · v3 · submitted 2026-04-07 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

Anjie Fang, Fardin Abdi, Mohit Bansal, Pin-Jie Lin, Rishab Balasubramanian, Rituraj Sharma, Tu Vu, Viktor Rozgic, Zheng Du

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords capability transferlatent subspacelinear alignmenttraining-freechain-of-thoughtmathematical reasoningactivation contrastUNLOCK

0 comments

The pith

The Master Key Hypothesis claims that capabilities are directions in a low-dimensional subspace transferable across models by linear alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper advances the Master Key Hypothesis, claiming that post-trained capabilities like reasoning reside in specific directions within a model's latent space. These directions can be isolated by contrasting activations from models with and without the capability, then linearly aligned and transferred to other models of varying sizes. If accurate, this would mean behaviors can be elicited in target models at inference time without any retraining or labeled data. The authors demonstrate this with UNLOCK on chain-of-thought and mathematical reasoning tasks, achieving notable accuracy improvements such as 12.1% on MATH when moving from a 14B to a 7B model. The approach amplifies existing latent capabilities rather than creating new ones from scratch.

Core claim

We propose the Master Key Hypothesis, which states that model capabilities correspond to directions in a low-dimensional latent subspace that induce specific behaviors and are transferable across models through linear alignment. Based on this hypothesis, we introduce UNLOCK, a training-free and label-free framework that extracts a capability direction by contrasting activations between capability-present and capability-absent Source variants, aligns it with a Target model through a low-rank linear transformation, and applies it at inference time to elicit the behavior. Experiments show that this leads to substantial gains in reasoning tasks across model scales, with success depending on pre-

What carries the argument

The capability direction extracted via activation contrasting and aligned through low-rank linear transformation to transfer behaviors across models.

Load-bearing premise

Contrasting activations between variants with and without a capability isolates a pure, transferable direction free from unrelated behaviors or model-specific artifacts.

What would settle it

A test where the extracted direction fails to improve target model performance on the intended task when applied, particularly if the contrast captures scale-specific or artifactual differences instead.

Figures

Figures reproduced from arXiv: 2604.06377 by Anjie Fang, Fardin Abdi, Mohit Bansal, Pin-Jie Lin, Rishab Balasubramanian, Rituraj Sharma, Tu Vu, Viktor Rozgic, Zheng Du.

**Figure 1.** Figure 1: Performance improvements from Unlock when transferring (a) Chain-of-Thought capabilities from Qwen1.5-14B onto Qwen1.5-7B; and (b) Math reasoning capabilities from Qwen3-4B-Base onto Qwen3-14B-Base. Capability transfer substantially improves the base model without additional training, approaching the performance of the post-trained model. elicit post-training behaviors without the need for retraining coul… view at source ↗

**Figure 2.** Figure 2: Illustration of Unlock: Our method consists of three stages: (1) Calculating the difference in hidden states in the Source space; (2) Learning a linear transformation between the Source Locked and Target Locked models; and (3) Projecting the MasterKey from Source to Target space and applying as a test-time intervention to the residual stream at every layer. The framework involves three conceptual models: •… view at source ↗

**Figure 3.** Figure 3: plots the average length of generated outputs for each model and dataset. A consistent increase in generation length is observed across all model–dataset pairs, supporting the view that the performance gains stem from Chain-of-Thought elicitation rather than surface-level output changes. We provide further analysis into the structure of the generated outputs and examples of step-by-step reasoning from 𝒯U i… view at source ↗

**Figure 4.** Figure 4: Statistics of First Generated Word: The output distribution is significantly skewed to a minimal set of starting traces post-steering. 5.3 Convergence of Reasoning Traces: To probe for the source of 𝒯U ’s gains, we analyze the structure of the generated reasoning traces and find that the Unlocked model displays a narrower set of opening trajectories. We show the distribution of the first generated token of… view at source ↗

**Figure 5.** Figure 5: Example Prompts: Example CoT (green) and Direct (red) prompts which are used for all evaluations. We show the Direct and CoT prompts that we used in [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Increased Generation Length of 𝒯U : Unlock leads to a clear increase in generation length over the base model with Direct prompting, matching the length of the instruction-tuned model with explicit CoT prompts. B Additional Results for Unlocking Chain of Thought We provide comparisons of the Unlocked model 𝒯U to the post-trained model 𝒯 ∗ PT , along with additional experiments for Qwen2.5 and Qwen3 model f… view at source ↗

**Figure 7.** Figure 7: Evidence for Improved Reasoning: Generation length of the Unlocked model significantly increases over the Locked model, with a corresponding improvement in downstream performance. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Spectral Entropy of the Covariance Matrix: Increasing the number of examples leads to a corresponding increase in entropy — providing evidence that the MasterKey captures more information with additional examples. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Convergence in performance of the linear transformation at low ranks: The normalized ℓ 2 error of the linear mapping as a function of number of samples 𝑛 with rank 𝑘 = 4 shows the diminishing impact of additional examples in rank-constrained settings. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Overfitting of the linear transformation at high ranks: The normalized ℓ 2 error of the linear mapping as a function of rank 𝑘 with number of samples 𝑛 = 512 shows the overfitting of the transformation at high ranks. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Representation Space with 𝚽=Avg: Performance of OLMo-2-7B + Unlockfrom 1𝐵 (top) and OLMo-2-7B + Unlockfrom 13𝐵 (bottom) with the mean aggregator. 2 0 2 2 2 4 2 6 2 8 2 10 Rank k 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 N u m b e r o f S a m p l e s n OLMo-2-1124-7B + from 1B 12.0 15.0 15.0 18.0 18.0 18.0 21.0 21.0 21.0 21.0 24.0 24.0 24.0 24.0 27.0 27.0 27.0 30.0 30.0 33.0 33.0 36.0 39.0 39.0 Invalid region (… view at source ↗

**Figure 12.** Figure 12: Representation Space with 𝚽=PCA: Performance of OLMo-2-7B + Unlockfrom 1𝐵 (top) and OLMo-2-7B + Unlockfrom 13𝐵 (bottom) with the principal component aggregator. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: Additional Statistics of the First-generated Word: A clear shift in the distribution of first-generated word is observed after applying Unlock 41 [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗

**Figure 14.** Figure 14: Length to answer (left); and Number of repeating substrings (middle and right) for Qwen3 and Ministral-3 families. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗

read the original abstract

We investigate whether post-trained capabilities can be transferred across models without retraining, with a focus on transfer across different model scales. We propose the Master Key Hypothesis, which states that model capabilities correspond to directions in a low-dimensional latent subspace that induce specific behaviors and are transferable across models through linear alignment. Based on this hypothesis, we introduce UNLOCK, a training-free and label-free framework that extracts a capability direction by contrasting activations between capability-present and capability-absent Source variants, aligns it with a Target model through a low-rank linear transformation, and applies it at inference time to elicit the behavior. Experiments on reasoning behaviors, including Chain-of-Thought (CoT) and mathematical reasoning, demonstrate substantial improvements across model scales without training. For example, transferring CoT reasoning from Qwen1.5-14B to Qwen1.5-7B yields an accuracy gain of 12.1% on MATH, and transferring a mathematical reasoning direction from Qwen3-4B-Base to Qwen3-14B-Base improves AGIEval Math accuracy from 61.1% to 71.3%, surpassing the 67.8% achieved by the 14B post-trained model. Our analysis shows that the success of transfer depends on the capabilities learned during pre-training, and that our intervention amplifies latent capabilities by sharpening the output distribution toward successful reasoning trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets some real accuracy lifts on reasoning tasks by contrasting activations and applying a low-rank linear map across model scales, but the claim that this cleanly isolates a transferable capability direction lacks the controls needed to rule out confounds.

read the letter

The main thing to know is that UNLOCK extracts a direction via activation contrast between variants that do and do not show the target behavior, aligns it with a low-rank linear transform, and applies it at inference to improve performance on smaller or differently scaled models. They report a 12.1% MATH gain moving CoT from Qwen1.5-14B to the 7B version and a jump from 61.1% to 71.3% on AGIEval Math when moving a reasoning direction from a 4B base to a 14B base, beating the post-trained 14B baseline. That is concrete and training-free, which is the practical hook.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Master Key Hypothesis that model capabilities correspond to directions in a low-dimensional latent subspace, transferable across models via linear alignment. It introduces the training-free UNLOCK framework, which extracts a capability direction by contrasting activations between capability-present and capability-absent source variants, aligns it to a target model using a low-rank linear transformation, and applies the direction at inference to elicit the behavior. Experiments on reasoning tasks report gains such as 12.1% accuracy improvement on MATH when transferring CoT from Qwen1.5-14B to Qwen1.5-7B, and improvement on AGIEval Math from 61.1% to 71.3% when transferring from Qwen3-4B-Base to Qwen3-14B-Base (surpassing the post-trained 14B baseline of 67.8%). The analysis notes that transfer success depends on pre-training capabilities.

Significance. If the central claim holds, the work would demonstrate a practical, training-free approach to cross-scale capability transfer grounded in linear subspace alignment, with potential implications for efficient model adaptation and mechanistic interpretability. The empirical results on external benchmarks (MATH, AGIEval) and the observation that intervention amplifies latent pre-trained capabilities provide concrete evidence worth further investigation; the training-free and label-free design is a notable strength.

major comments (3)

[Abstract / UNLOCK framework] Abstract and methods description of UNLOCK: the core claim that activation contrast between capability-present and capability-absent source variants isolates a pure, transferable capability direction (rather than a mixture of post-training differences, scale artifacts, or unrelated behaviors) is load-bearing for the Master Key Hypothesis but lacks supporting controls or ablations; the reported gains (e.g., +12.1% MATH, +10.2% AGIEval Math) could arise from incidental effects of the contrast without demonstrating subspace alignment.
[Abstract] Abstract: the claim that the method 'surpasses the 67.8% achieved by the 14B post-trained model' and yields substantial improvements across scales is not accompanied by error bars, full baseline details, or exclusion criteria, undermining the ability to assess whether the linear alignment step is responsible for the effect rather than other factors.
[Analysis] Analysis section: while the paper states that success depends on capabilities learned during pre-training, this dependence is not quantified with specific metrics, ablation studies, or comparisons that would test whether the extracted direction is independent of scale-specific representational differences.

minor comments (2)

[Abstract / Experiments] The abstract and experiments section would benefit from explicit reporting of variance across runs and precise definitions of 'capability-present' vs. 'capability-absent' source variants for each transfer pair.
[Methods] Notation for the low-rank linear transformation and activation contrast operation could be clarified with an equation or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. We outline specific revisions that will be incorporated to address the concerns while preserving the core contributions of the Master Key Hypothesis and UNLOCK framework.

read point-by-point responses

Referee: [Abstract / UNLOCK framework] Abstract and methods description of UNLOCK: the core claim that activation contrast between capability-present and capability-absent source variants isolates a pure, transferable capability direction (rather than a mixture of post-training differences, scale artifacts, or unrelated behaviors) is load-bearing for the Master Key Hypothesis but lacks supporting controls or ablations; the reported gains (e.g., +12.1% MATH, +10.2% AGIEval Math) could arise from incidental effects of the contrast without demonstrating subspace alignment.

Authors: We agree that additional controls are needed to isolate the contribution of the capability-specific contrast. The manuscript already shows that transfer succeeds only when the target possesses relevant pre-trained capabilities and fails otherwise, providing indirect support for specificity. In the revised version, we will add explicit ablations: (i) contrasts derived from unrelated behaviors, (ii) random activation differences of matched magnitude, and (iii) direct activation transfer without the low-rank alignment step. These will quantify that only the capability contrast produces the reported gains, thereby strengthening the evidence for linear subspace alignment. revision: yes
Referee: [Abstract] Abstract: the claim that the method 'surpasses the 67.8% achieved by the 14B post-trained model' and yields substantial improvements across scales is not accompanied by error bars, full baseline details, or exclusion criteria, undermining the ability to assess whether the linear alignment step is responsible for the effect rather than other factors.

Authors: We concur that error bars, expanded baselines, and clearer exclusion criteria would improve interpretability. In the revised manuscript, we will report standard deviations across multiple evaluation seeds for the primary results, add baselines including no-alignment activation addition, random-direction interventions, and scale-matched controls, and explicitly state the model-pair selection criteria (pre-training data overlap and capability presence). These additions will allow readers to isolate the role of the linear alignment step. revision: yes
Referee: [Analysis] Analysis section: while the paper states that success depends on capabilities learned during pre-training, this dependence is not quantified with specific metrics, ablation studies, or comparisons that would test whether the extracted direction is independent of scale-specific representational differences.

Authors: The current analysis demonstrates the dependence qualitatively through successful versus failed transfers on models with differing pre-training exposure. To quantify this, the revised analysis section will include cosine similarity between source and target capability directions as a metric of alignment quality, plus controlled ablations across model families with systematically varied pre-training corpora. These additions will provide measurable evidence that the extracted direction is not reducible to scale-specific artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical hypothesis and method with external validation

full rationale

The paper proposes the Master Key Hypothesis as a conceptual statement and defines the UNLOCK framework as a concrete, training-free procedure that extracts directions via activation contrasts between source variants, applies low-rank linear alignment, and evaluates on external benchmarks such as MATH and AGIEval. No equations are presented that derive the hypothesis from itself or rename fitted parameters as predictions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central steps (contrast extraction and linear transfer) are operational definitions tested against held-out data rather than tautological reductions, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that capabilities are low-dimensional directions, with the low-rank transformation as a free parameter and the capability direction as an invented entity supported only by the transfer experiments.

free parameters (1)

rank of linear transformation
The dimensionality for the low-rank approximation in alignment is a hyperparameter selected to enable transfer, likely tuned on performance.

axioms (2)

domain assumption Model capabilities correspond to directions in a low-dimensional latent subspace.
This is the core Master Key Hypothesis stated directly in the abstract.
domain assumption A low-rank linear transformation can align these directions across models of different scales.
Invoked to justify the UNLOCK alignment step for transfer.

invented entities (1)

capability direction (Master Key) no independent evidence
purpose: To represent and enable transfer of specific post-trained behaviors like CoT or math reasoning.
Postulated in the hypothesis; independent evidence is limited to the reported transfer experiments with no external falsifiable predictions.

pith-pipeline@v0.9.0 · 5584 in / 1443 out tokens · 45840 ms · 2026-05-10T18:32:46.818427+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
extracts a capability direction by contrasting activations between capability-present and capability-absent Source variants, aligns it with a Target model through a low-rank linear transformation
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
low-dimensional latent subspace... linear alignment map f : ℝ^k → ℝ^k

Reference graph

Works this paper leans on

135 extracted references · 67 canonical work pages · 25 internal anchors

[1]

W. U. Ahmad, S. Majumdar, A. Ficek, S. Narenthiran, M. Samadi, J. Huang, S. Jain, V. Noroozi, and B. Ginsburg. Opencodereasoning-ii: A simple test time scaling approach via self-critique,
[2]

URLhttps://arxiv.org/abs/2507.09075

work page arXiv
[3]

S. N. Akter, S. Prabhumoye, E. Nyberg, M. Patwary, M. Shoeybi, Y. Choi, and B. Catanzaro. Front-loading reasoning: The synergy between pretraining and post-training data, 2025. URL https://arxiv.org/abs/2510.03264

work page arXiv 2025
[4]

J. E. Anton de la Fuente. Thought editing: Steering models by editing their chain of thought, 2026. URL https://www.lesswrong.com/posts/KXR5FNs4hHT5sMRti/ steering-models-by-editing-their-chain-of-thought

2026
[5]

Arditi, O

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction, 2024. URLhttps://arxiv.org/abs/2406. 11717

2024
[6]

Activationsteeringforchain-of-thoughtcompression,

S.Azizi,E.B.Potraghloo,andM.Pedram. Activationsteeringforchain-of-thoughtcompression,
[7]

URLhttps://arxiv.org/abs/2507.04742

work page arXiv
[8]

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. URL https://arxiv.org/abs/2309. 16609

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Bello, A

F. Bello, A. Das, F. Zeng, F. Yin, and L. Leqi. Linear representation transferability hypothesis: Leveragingsmallmodelstosteerlargemodels,2025. URL https://arxiv.org/abs/2506.00653

work page arXiv 2025
[10]

Berman, A

N. Berman, A. Hallak, and A. Shocher. Who said neural networks aren’t linear?, 2025. URL https://arxiv.org/abs/2510.08570

work page arXiv 2025
[11]

Buzzega, R

P. Buzzega, R. Salami, A. Porrello, and S. Calderara. Rethinking layer-wise model merging through chain of merges, 2025. URLhttps://arxiv.org/abs/2508.21421

work page arXiv 2025
[12]

arXiv preprint arXiv:2503.08727 , year=

L.Caccia,A.Ansell,E.Ponti,I.Vulić,andA.Sordoni. Trainingplug-n-playknowledgemodules with deep context distillation, 2025. URLhttps://arxiv.org/abs/2503.08727

work page arXiv 2025
[13]

H. Chen, C. Vondrick, and C. Mao. Selfie: Self-interpretation of large language model embed- dings, 2024. URLhttps://arxiv.org/abs/2403.10949

work page arXiv 2024
[15]

URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Csordás, C

R. Csordás, C. D. Manning, and C. Potts. Do language models use their depth efficiently? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=Kz6eUL86XP

2025
[17]

G. Cui, Y. Zhang, J. Chen, L. Yuan, Z. Wang, Y. Zuo, H. Li, Y. Fan, H. Chen, W. Chen, Z. Liu, H. Peng, L. Bai, W. Ouyang, Y. Cheng, B. Zhou, and N. Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URLhttps://arxiv.org/abs/ 2505.22617

work page internal anchor Pith review arXiv 2025
[18]

Nudging: Inference-timealignmentofllmsviaguideddecoding,

Y.Fei,Y.Razeghi,andS.Singh. Nudging: Inference-timealignmentofllmsviaguideddecoding,
[19]

URLhttps://arxiv.org/abs/2410.09300

work page arXiv
[20]

Ghandeharioun, A

A. Ghandeharioun, A. Caciularu, A. Pearce, L. Dixon, and M. Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models, 2024. URLhttps:// arxiv.org/abs/2401.06102

work page arXiv 2024
[21]

Y. Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models, 2025. URLhttps://arxiv.org/abs/2306.08543

work page internal anchor Pith review arXiv 2025
[22]

2024 , journal =

W. Gurnee and M. Tegmark. Language models represent space and time, 2024. URLhttps: //arxiv.org/abs/2310.02207. 16

work page arXiv 2024
[23]

C. He, R. Luo, Y. Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y. Huang, Y. Zhang, J. Liu, L. Qi, Z. Liu, and M. Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URLhttps://arxiv.org/abs/ 2402.14008

work page internal anchor Pith review arXiv 2024
[25]

URLhttps://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Thereasoning-memorizationinterplayinlanguage models is mediated by a single direction, 2025

Y.Hong,D.Zhou,M.Cao,L.Yu,andZ.Jin. Thereasoning-memorizationinterplayinlanguage models is mediated by a single direction, 2025. URLhttps://arxiv.org/abs/2503.23084

work page arXiv 2025
[27]

Huang, P.-Z

S.-C. Huang, P.-Z. Li, Y.-C. Hsu, K.-M. Chen, Y. T. Lin, S.-K. Hsiao, R. T.-H. Tsai, and H. yi Lee. Chat vector: A simple approach to equip llms with instruction following and model alignment in new languages, 2024. URLhttps://arxiv.org/abs/2310.04799

work page arXiv 2024
[28]

Huang, C

Y. Huang, C. Huang, D. Feng, W. Lei, and J. Lv. Cross-model transferability among large language models on the platonic representations of concepts, 2025. URLhttps://arxiv.org/ abs/2501.02009

work page arXiv 2025
[29]

The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

M.Huh,B.Cheung,T.Wang,andP.Isola. Theplatonicrepresentationhypothesis.arXivpreprint arXiv:2405.07987, 2024. URLhttps://arxiv.org/abs/2405.07987

work page arXiv 2024
[30]

Reinforcement Learning via Self-Distillation

J. Hübotter, F. Lübeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, and A. Krause. Reinforcement learning via self-distillation, 2026. URLhttps://arxiv.org/abs/2601.20802

work page internal anchor Pith review arXiv 2026
[31]

Editing Models with Task Arithmetic

G.Ilharco,M.T.Ribeiro,M.Wortsman,S.Gururangan,L.Schmidt,H.Hajishirzi,andA.Farhadi. Editing models with task arithmetic, 2023. URLhttps://arxiv.org/abs/2212.04089

work page internal anchor Pith review arXiv 2023
[32]

arXiv preprint arXiv:2512.05117 (2025)

P.Kaushik,S.Chaudhari,A.Vaidya,R.Chellappa,andA.Yuille. Theuniversalweightsubspace hypothesis, 2025. URLhttps://arxiv.org/abs/2512.05117

work page arXiv 2025
[33]

Konen, S

K. Konen, S. Jentzsch, D. Diallo, P. Schütt, O. Bensch, R. E. Baff, D. Opitz, and T. Hecking. Style vectors for steering generative large language model, 2024. URLhttps://arxiv.org/abs/2402. 01618

2024
[34]

Solving Quantitative Reasoning Problems with Language Models

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur, G. Gur-Ari, and V. Misra. Solving quantitative reasoning problems with language models, 2022. URLhttps://arxiv.org/abs/ 2206.14858

work page internal anchor Pith review arXiv 2022
[35]

Li, J.-C

K. Li, J.-C. Pang, and Y. Yu. Rlvr training of llms does not improve thinking ability for general qa: Evaluation method and a simple solution, 2026. URLhttps://arxiv.org/abs/2603.20799

work page arXiv 2026
[36]

arXiv preprint arXiv:2401.08190

M.Liao,W.Luo,C.Li,J.Wu,andK.Fan. Mario: Mathreasoningwithcodeinterpreteroutput–a reproducible pipeline.arXiv preprint arXiv:2401.08190, 2024

work page arXiv 2024
[37]

A. Liu, X. Han, Y. Wang, Y. Tsvetkov, Y. Choi, and N. A. Smith. Tuning language models by proxy, 2024. URLhttps://arxiv.org/abs/2401.08565

work page arXiv 2024
[38]

A. H. Liu, K. Khandelwal, S. Subramanian, V. Jouault, A. Rastogi, A. Sadé, A. Jeffares, A. Jiang, A. Cahill, A. Gavaudan, A. Sablayrolles, A. Héliou, A. You, A. Ehrenberg, A. Lo, A. Eliseev, A. Calvi, A. Sooriyarachchi, B. Bout, B. Rozière, B. D. Monicault, C. Lanfranchi, C. Barreau, C. Courtot, D. Grattarola, D. Dabert, D. de las Casas, E. Chane-Sane, F....

work page internal anchor Pith review arXiv 2026
[39]

Midtrainingbridgespretrainingandposttrainingdistributions,

E.Liu,G.Neubig,andC.Xiong. Midtrainingbridgespretrainingandposttrainingdistributions,
[40]

URLhttps://arxiv.org/abs/2510.14865

work page arXiv
[41]

S. Liu, H. Ye, L. Xing, and J. Zou. In-context vectors: Making in context learning more effective andcontrollablethroughlatentspacesteering,2024. URL https://arxiv.org/abs/2311.06668

work page arXiv 2024
[42]

S.-Y. Liu, X. Dong, X. Lu, S. Diao, M. Liu, M.-H. Chen, H. Yin, Y.-C. F. Wang, K.-T. Cheng, Y. Choi, et al. Dler: Doing length penalty right-incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110, 2025

work page arXiv 2025
[43]

Efficient Estimation of Word Representations in Vector Space

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space, 2013. URLhttps://arxiv.org/abs/1301.3781

work page internal anchor Pith review arXiv 2013
[44]

InProceedings of the 62nd annual meet- ing of the association for computational linguistics (volume 1: Long papers), pages 15789–15809

D. Nguyen, A. Prasad, E. Stengel-Eskin, and M. Bansal. Grains: Gradient-based attribution for inference-time steering of llms and vlms.arXiv preprint arXiv:2507.18043, 2025. URL https://arxiv.org/abs/2507.18043

work page arXiv 2025
[45]

T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y. Gu, S. Huang, M.Jordan,N.Lambert,D.Schwenk,O.Tafjord,T.Anderson,D.Atkinson,F.Brahman,C.Clark, P. Dasigi, N. Dziri, A. Ettinger, M. Guerquin, D. Heineman, H. Ivison, P. W. Koh, J. Liu, S. Malik, W.Merrill,L.J.V.Miranda,J.Morrison,T.Murray,C.Nam,J.Poznanski,V.Pyatkin,A.Rangapur, M...

work page internal anchor Pith review arXiv 2025
[46]

Oozeer, D

N. Oozeer, D. Nathawani, N. Prakash, M. Lan, A. Harrasse, and A. Abdullah. Activation space interventions can be transferred between large language models, 2025. URLhttps: //arxiv.org/abs/2503.04429

work page arXiv 2025
[47]

RAST: reasoning activation in llms via small-model transfer

S. Ouyang, X. Zhu, Z. Xiao, M. Jiang, Y. Meng, and J. Han. Rast: Reasoning activation in llms via small-model transfer, 2025. URLhttps://arxiv.org/abs/2506.15710

work page arXiv 2025
[48]

Steering Llama 2 via Contrastive Activation Addition

N. Panickssery, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering llama 2 via contrastive activation addition, 2024. URLhttps://arxiv.org/abs/2312.06681

work page internal anchor Pith review arXiv 2024
[49]

K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models, 2024. URLhttps://arxiv.org/abs/2311.03658

work page internal anchor Pith review arXiv 2024
[50]

Are NLP Models really able to Solve Simple Math Word Problems?

A. Patel, S. Bhattamishra, and N. Goyal. Are NLP models really able to solve simple math word problems? In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: ...

work page doi:10.18653/v1/2021.naacl-main.168 2021
[51]

Y. Qin, Y. Lin, J. Yi, J. Zhang, X. Han, Z. Zhang, Y. Su, Z. Liu, P. Li, M. Sun, and J. Zhou. Knowledge inheritance for pre-trained language models, 2022. URLhttps://arxiv.org/abs/ 2105.13880

work page arXiv 2022
[52]

Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z...
[54]

Analysing mathematical reasoning abilities of neural models

D. Saxton, E. Grefenstette, F. Hill, and P. Kohli. Analysing mathematical reasoning abilities of neural models, 2019. URLhttps://arxiv.org/abs/1904.01557

work page arXiv 2019
[55]

Z. Shen, H. Yan, L. Zhang, Z. Hu, Y. Du, and Y. He. Codi: Compressing chain-of-thought into continuous space via self-distillation, 2025. URLhttps://arxiv.org/abs/2502.21074. 18

work page arXiv 2025
[56]

Self-distillationenablescontinuallearning,

I.Shenfeld, M.Damani, J.Hübotter, andP.Agrawal. Self-distillationenablescontinuallearning,
[57]

URLhttps://arxiv.org/abs/2601.19897

work page internal anchor Pith review arXiv
[58]

Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

O. Skean, M. R. Arefin, D. Zhao, N. Patel, J. Naghiyev, Y. LeCun, and R. Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models, 2025. URLhttps://arxiv.org/ abs/2502.02013

work page arXiv 2025
[59]

Stoehr, K

N. Stoehr, K. Du, V. Snæbjarnarson, R. West, R. Cotterell, and A. Schein. Activation scaling for steering and interpreting language models, 2024. URLhttps://arxiv.org/abs/2410.04962

work page arXiv 2024
[60]

Improvinginstruction-following in language models through activation steering, 2025

A.Stolfo,V.Balachandran,S.Yousefi,E.Horvitz,andB.Nushi. Improvinginstruction-following in language models through activation steering, 2025. URLhttps://arxiv.org/abs/2410. 12877

2025
[61]

D. Tan, D. Chanin, A. Lynch, D. Kanoulas, B. Paige, A. Garriga-Alonso, and R. Kirk. Analyzing the generalization and reliability of steering vectors, 2025. URLhttps://arxiv.org/abs/2407. 12404

2025
[62]

G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024. URLhttps://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, L. Rouillard, T. Mesnard, G. Cideron, J. bastien Grill, S. Ramos, E. Yvinec, M. Casbon, E. Pot, I. Penchev, G. Liu, F. Visin, K. Kenealy, L. Beyer, X. Zhai, A. Tsitsulin, R. Busa-Fekete, A. Feng, N. Sachdeva, B. Coleman, Y. Gao, B. Mustafa, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

E. Todd, M. L. Li, A. S. Sharma, A. Mueller, B. C. Wallace, and D. Bau. Function vectors in large language models, 2024. URLhttps://arxiv.org/abs/2310.15213

work page arXiv 2024
[66]

A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering, 2024. URLhttps://arxiv.org/abs/ 2308.10248

work page internal anchor Pith review arXiv 2024
[67]

Extending activation steering to broad skills and multiple behaviours.arXiv preprint arXiv:2403.05767,

T. van der Weij, M. Poesio, and N. Schoots. Extending activation steering to broad skills and multiple behaviours, 2024. URLhttps://arxiv.org/abs/2403.05767. 19

work page arXiv 2024
[68]

Base models know how to reason, thinking models learn when.arXiv preprint arXiv:2510.07364,

C. Venhoff, I. Arcuschin, P. Torr, A. Conmy, and N. Nanda. Base models know how to reason, thinking models learn when, 2025. URLhttps://arxiv.org/abs/2510.07364

work page arXiv 2025
[69]

arXiv preprint arXiv:2506.18167 , year=

C.Venhoff,I.Arcuschin,P.Torr,A.Conmy,andN.Nanda. Understandingreasoninginthinking language models via steering vectors, 2025. URLhttps://arxiv.org/abs/2506.18167

work page arXiv 2025
[70]

E. P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y. Gu, S. Huang, M. Jordan, N. Lambert, D. Schwenk, O. Tafjord, T. Anderson, D. Atkinson, F. Brahman, C. Clark, P. Dasigi, N. Dziri, A. Ettinger, M. Guerquin, D. Heineman, H. Ivison, P. W. Koh, J. Liu, S. Malik, W. Mer- rill, L. J. V. Miranda, J. Morrison, T. Murray, C. Nam, J. Poznanski...

2025
[71]

F. Wan, L. Zhong, Z. Yang, R. Chen, and X. Quan. Fusechat: Knowledge fusion of chat models,
[72]

URLhttps://arxiv.org/abs/2408.07990

work page arXiv
[73]

B. Wang, C. Lee, N. Lee, S.-C. Lin, W. Dai, Y. Chen, Y. Chen, Z. Yang, Z. Liu, M. Shoeybi, B. Catanzaro, and W. Ping. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models, 2025. URLhttps://arxiv.org/abs/2512.13607

work page arXiv 2025
[74]

J. Wang, Y. Chen, Z. Li, and C. Huang. Lightreasoner: Can small language models teach large language models reasoning?, 2025. URLhttps://arxiv.org/abs/2510.07962

work page arXiv 2025
[75]

S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X. Chen, J. Yang, Z. Zhang, Y. Liu, A. Yang, A. Zhao, Y. Yue, S. Song, B. Yu, G. Huang, and J. Lin. Beyond the 80/20 rule: High- entropy minority tokens drive effective reinforcement learning for llm reasoning, 2025. URL https://arxiv.org/abs/2506.01939

work page internal anchor Pith review arXiv 2025
[76]

Z. Wang, F. Zhou, X. Li, and P. Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling, 2025. URLhttps://arxiv.org/abs/2506.20512

work page arXiv 2025
[77]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URLhttps: //arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

T. Wu, R. Yang, J. Li, P. Hu, Y.-C. Wu, N. Wong, and Y. Yang. Shadow-ft: Tuning instruct model via training on paired base model, 2025. URLhttps://arxiv.org/abs/2505.12716

work page arXiv 2025
[79]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/ 2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025. URLhttps://arxiv.org/abs/ 2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[81]

Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837

work page internal anchor Pith review arXiv 2025
[82]

Zbeeb, H

M. Zbeeb, H. A. A. K. Hammoud, and B. Ghanem. Reasoning vectors: Transferring chain-of- thought capabilities via task arithmetic, 2025. URLhttps://arxiv.org/abs/2509.01363

work page arXiv 2025
[83]

[[rating]]

W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URLhttps://arxiv.org/ abs/2304.06364

work page arXiv 2023
[84]

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Z. Zhong and A. Raghunathan. Watch the weights: Unsupervised monitoring and control of fine-tuned llms, 2025. URLhttps://arxiv.org/abs/2508.00161

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.