pith. machine review for the scientific record. sign in

arxiv: 2604.06377 · v3 · submitted 2026-04-07 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

Anjie Fang, Fardin Abdi, Mohit Bansal, Pin-Jie Lin, Rishab Balasubramanian, Rituraj Sharma, Tu Vu, Viktor Rozgic, Zheng Du

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords capability transferlatent subspacelinear alignmenttraining-freechain-of-thoughtmathematical reasoningactivation contrastUNLOCK
0
0 comments X

The pith

The Master Key Hypothesis claims that capabilities are directions in a low-dimensional subspace transferable across models by linear alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper advances the Master Key Hypothesis, claiming that post-trained capabilities like reasoning reside in specific directions within a model's latent space. These directions can be isolated by contrasting activations from models with and without the capability, then linearly aligned and transferred to other models of varying sizes. If accurate, this would mean behaviors can be elicited in target models at inference time without any retraining or labeled data. The authors demonstrate this with UNLOCK on chain-of-thought and mathematical reasoning tasks, achieving notable accuracy improvements such as 12.1% on MATH when moving from a 14B to a 7B model. The approach amplifies existing latent capabilities rather than creating new ones from scratch.

Core claim

We propose the Master Key Hypothesis, which states that model capabilities correspond to directions in a low-dimensional latent subspace that induce specific behaviors and are transferable across models through linear alignment. Based on this hypothesis, we introduce UNLOCK, a training-free and label-free framework that extracts a capability direction by contrasting activations between capability-present and capability-absent Source variants, aligns it with a Target model through a low-rank linear transformation, and applies it at inference time to elicit the behavior. Experiments show that this leads to substantial gains in reasoning tasks across model scales, with success depending on pre-

What carries the argument

The capability direction extracted via activation contrasting and aligned through low-rank linear transformation to transfer behaviors across models.

Load-bearing premise

Contrasting activations between variants with and without a capability isolates a pure, transferable direction free from unrelated behaviors or model-specific artifacts.

What would settle it

A test where the extracted direction fails to improve target model performance on the intended task when applied, particularly if the contrast captures scale-specific or artifactual differences instead.

Figures

Figures reproduced from arXiv: 2604.06377 by Anjie Fang, Fardin Abdi, Mohit Bansal, Pin-Jie Lin, Rishab Balasubramanian, Rituraj Sharma, Tu Vu, Viktor Rozgic, Zheng Du.

Figure 1
Figure 1. Figure 1: Performance improvements from Unlock when transferring (a) Chain-of-Thought capabilities from Qwen1.5-14B onto Qwen1.5-7B; and (b) Math reasoning capabilities from Qwen3-4B-Base onto Qwen3-14B-Base. Capability transfer substantially improves the base model without additional training, approaching the perfor￾mance of the post-trained model. elicit post-training behaviors without the need for retraining coul… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of Unlock: Our method consists of three stages: (1) Calculating the difference in hidden states in the Source space; (2) Learning a linear transformation between the Source Locked and Target Locked models; and (3) Projecting the MasterKey from Source to Target space and applying as a test-time intervention to the residual stream at every layer. The framework involves three conceptual models: •… view at source ↗
Figure 3
Figure 3. Figure 3: plots the average length of generated outputs for each model and dataset. A consistent increase in generation length is observed across all model–dataset pairs, supporting the view that the performance gains stem from Chain-of-Thought elicitation rather than surface-level output changes. We provide further analysis into the structure of the generated outputs and examples of step-by-step reasoning from 𝒯U i… view at source ↗
Figure 4
Figure 4. Figure 4: Statistics of First Generated Word: The output distribution is significantly skewed to a minimal set of starting traces post-steering. 5.3 Convergence of Reasoning Traces: To probe for the source of 𝒯U ’s gains, we analyze the structure of the generated reasoning traces and find that the Unlocked model displays a narrower set of opening trajectories. We show the distribution of the first generated token of… view at source ↗
Figure 5
Figure 5. Figure 5: Example Prompts: Example CoT (green) and Direct (red) prompts which are used for all evaluations. We show the Direct and CoT prompts that we used in [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Increased Generation Length of 𝒯U : Unlock leads to a clear increase in generation length over the base model with Direct prompting, matching the length of the instruction-tuned model with explicit CoT prompts. B Additional Results for Unlocking Chain of Thought We provide comparisons of the Unlocked model 𝒯U to the post-trained model 𝒯 ∗ PT , along with additional experiments for Qwen2.5 and Qwen3 model f… view at source ↗
Figure 7
Figure 7. Figure 7: Evidence for Improved Reasoning: Generation length of the Unlocked model significantly increases over the Locked model, with a corresponding improvement in downstream performance. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Spectral Entropy of the Covariance Matrix: Increasing the number of examples leads to a correspond￾ing increase in entropy — providing evidence that the MasterKey captures more information with additional examples. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Convergence in performance of the linear transformation at low ranks: The normalized ℓ 2 error of the linear mapping as a function of number of samples 𝑛 with rank 𝑘 = 4 shows the diminishing impact of additional examples in rank-constrained settings. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overfitting of the linear transformation at high ranks: The normalized ℓ 2 error of the linear mapping as a function of rank 𝑘 with number of samples 𝑛 = 512 shows the overfitting of the transformation at high ranks. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Representation Space with 𝚽=Avg: Performance of OLMo-2-7B + Unlockfrom 1𝐵 (top) and OLMo-2-7B + Unlockfrom 13𝐵 (bottom) with the mean aggregator. 2 0 2 2 2 4 2 6 2 8 2 10 Rank k 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 N u m b e r o f S a m p l e s n OLMo-2-1124-7B + from 1B 12.0 15.0 15.0 18.0 18.0 18.0 21.0 21.0 21.0 21.0 24.0 24.0 24.0 24.0 27.0 27.0 27.0 30.0 30.0 33.0 33.0 36.0 39.0 39.0 Invalid region (… view at source ↗
Figure 12
Figure 12. Figure 12: Representation Space with 𝚽=PCA: Performance of OLMo-2-7B + Unlockfrom 1𝐵 (top) and OLMo-2-7B + Unlockfrom 13𝐵 (bottom) with the principal component aggregator. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional Statistics of the First-generated Word: A clear shift in the distribution of first-generated word is observed after applying Unlock 41 [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Length to answer (left); and Number of repeating substrings (middle and right) for Qwen3 and Ministral-3 families. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗
read the original abstract

We investigate whether post-trained capabilities can be transferred across models without retraining, with a focus on transfer across different model scales. We propose the Master Key Hypothesis, which states that model capabilities correspond to directions in a low-dimensional latent subspace that induce specific behaviors and are transferable across models through linear alignment. Based on this hypothesis, we introduce UNLOCK, a training-free and label-free framework that extracts a capability direction by contrasting activations between capability-present and capability-absent Source variants, aligns it with a Target model through a low-rank linear transformation, and applies it at inference time to elicit the behavior. Experiments on reasoning behaviors, including Chain-of-Thought (CoT) and mathematical reasoning, demonstrate substantial improvements across model scales without training. For example, transferring CoT reasoning from Qwen1.5-14B to Qwen1.5-7B yields an accuracy gain of 12.1% on MATH, and transferring a mathematical reasoning direction from Qwen3-4B-Base to Qwen3-14B-Base improves AGIEval Math accuracy from 61.1% to 71.3%, surpassing the 67.8% achieved by the 14B post-trained model. Our analysis shows that the success of transfer depends on the capabilities learned during pre-training, and that our intervention amplifies latent capabilities by sharpening the output distribution toward successful reasoning trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Master Key Hypothesis that model capabilities correspond to directions in a low-dimensional latent subspace, transferable across models via linear alignment. It introduces the training-free UNLOCK framework, which extracts a capability direction by contrasting activations between capability-present and capability-absent source variants, aligns it to a target model using a low-rank linear transformation, and applies the direction at inference to elicit the behavior. Experiments on reasoning tasks report gains such as 12.1% accuracy improvement on MATH when transferring CoT from Qwen1.5-14B to Qwen1.5-7B, and improvement on AGIEval Math from 61.1% to 71.3% when transferring from Qwen3-4B-Base to Qwen3-14B-Base (surpassing the post-trained 14B baseline of 67.8%). The analysis notes that transfer success depends on pre-training capabilities.

Significance. If the central claim holds, the work would demonstrate a practical, training-free approach to cross-scale capability transfer grounded in linear subspace alignment, with potential implications for efficient model adaptation and mechanistic interpretability. The empirical results on external benchmarks (MATH, AGIEval) and the observation that intervention amplifies latent pre-trained capabilities provide concrete evidence worth further investigation; the training-free and label-free design is a notable strength.

major comments (3)
  1. [Abstract / UNLOCK framework] Abstract and methods description of UNLOCK: the core claim that activation contrast between capability-present and capability-absent source variants isolates a pure, transferable capability direction (rather than a mixture of post-training differences, scale artifacts, or unrelated behaviors) is load-bearing for the Master Key Hypothesis but lacks supporting controls or ablations; the reported gains (e.g., +12.1% MATH, +10.2% AGIEval Math) could arise from incidental effects of the contrast without demonstrating subspace alignment.
  2. [Abstract] Abstract: the claim that the method 'surpasses the 67.8% achieved by the 14B post-trained model' and yields substantial improvements across scales is not accompanied by error bars, full baseline details, or exclusion criteria, undermining the ability to assess whether the linear alignment step is responsible for the effect rather than other factors.
  3. [Analysis] Analysis section: while the paper states that success depends on capabilities learned during pre-training, this dependence is not quantified with specific metrics, ablation studies, or comparisons that would test whether the extracted direction is independent of scale-specific representational differences.
minor comments (2)
  1. [Abstract / Experiments] The abstract and experiments section would benefit from explicit reporting of variance across runs and precise definitions of 'capability-present' vs. 'capability-absent' source variants for each transfer pair.
  2. [Methods] Notation for the low-rank linear transformation and activation contrast operation could be clarified with an equation or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. We outline specific revisions that will be incorporated to address the concerns while preserving the core contributions of the Master Key Hypothesis and UNLOCK framework.

read point-by-point responses
  1. Referee: [Abstract / UNLOCK framework] Abstract and methods description of UNLOCK: the core claim that activation contrast between capability-present and capability-absent source variants isolates a pure, transferable capability direction (rather than a mixture of post-training differences, scale artifacts, or unrelated behaviors) is load-bearing for the Master Key Hypothesis but lacks supporting controls or ablations; the reported gains (e.g., +12.1% MATH, +10.2% AGIEval Math) could arise from incidental effects of the contrast without demonstrating subspace alignment.

    Authors: We agree that additional controls are needed to isolate the contribution of the capability-specific contrast. The manuscript already shows that transfer succeeds only when the target possesses relevant pre-trained capabilities and fails otherwise, providing indirect support for specificity. In the revised version, we will add explicit ablations: (i) contrasts derived from unrelated behaviors, (ii) random activation differences of matched magnitude, and (iii) direct activation transfer without the low-rank alignment step. These will quantify that only the capability contrast produces the reported gains, thereby strengthening the evidence for linear subspace alignment. revision: yes

  2. Referee: [Abstract] Abstract: the claim that the method 'surpasses the 67.8% achieved by the 14B post-trained model' and yields substantial improvements across scales is not accompanied by error bars, full baseline details, or exclusion criteria, undermining the ability to assess whether the linear alignment step is responsible for the effect rather than other factors.

    Authors: We concur that error bars, expanded baselines, and clearer exclusion criteria would improve interpretability. In the revised manuscript, we will report standard deviations across multiple evaluation seeds for the primary results, add baselines including no-alignment activation addition, random-direction interventions, and scale-matched controls, and explicitly state the model-pair selection criteria (pre-training data overlap and capability presence). These additions will allow readers to isolate the role of the linear alignment step. revision: yes

  3. Referee: [Analysis] Analysis section: while the paper states that success depends on capabilities learned during pre-training, this dependence is not quantified with specific metrics, ablation studies, or comparisons that would test whether the extracted direction is independent of scale-specific representational differences.

    Authors: The current analysis demonstrates the dependence qualitatively through successful versus failed transfers on models with differing pre-training exposure. To quantify this, the revised analysis section will include cosine similarity between source and target capability directions as a metric of alignment quality, plus controlled ablations across model families with systematically varied pre-training corpora. These additions will provide measurable evidence that the extracted direction is not reducible to scale-specific artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical hypothesis and method with external validation

full rationale

The paper proposes the Master Key Hypothesis as a conceptual statement and defines the UNLOCK framework as a concrete, training-free procedure that extracts directions via activation contrasts between source variants, applies low-rank linear alignment, and evaluates on external benchmarks such as MATH and AGIEval. No equations are presented that derive the hypothesis from itself or rename fitted parameters as predictions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central steps (contrast extraction and linear transfer) are operational definitions tested against held-out data rather than tautological reductions, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that capabilities are low-dimensional directions, with the low-rank transformation as a free parameter and the capability direction as an invented entity supported only by the transfer experiments.

free parameters (1)
  • rank of linear transformation
    The dimensionality for the low-rank approximation in alignment is a hyperparameter selected to enable transfer, likely tuned on performance.
axioms (2)
  • domain assumption Model capabilities correspond to directions in a low-dimensional latent subspace.
    This is the core Master Key Hypothesis stated directly in the abstract.
  • domain assumption A low-rank linear transformation can align these directions across models of different scales.
    Invoked to justify the UNLOCK alignment step for transfer.
invented entities (1)
  • capability direction (Master Key) no independent evidence
    purpose: To represent and enable transfer of specific post-trained behaviors like CoT or math reasoning.
    Postulated in the hypothesis; independent evidence is limited to the reported transfer experiments with no external falsifiable predictions.

pith-pipeline@v0.9.0 · 5584 in / 1443 out tokens · 45840 ms · 2026-05-10T18:32:46.818427+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

135 extracted references · 67 canonical work pages · 25 internal anchors

  1. [1]

    W. U. Ahmad, S. Majumdar, A. Ficek, S. Narenthiran, M. Samadi, J. Huang, S. Jain, V. Noroozi, and B. Ginsburg. Opencodereasoning-ii: A simple test time scaling approach via self-critique,

  2. [2]

    URLhttps://arxiv.org/abs/2507.09075

  3. [3]

    S. N. Akter, S. Prabhumoye, E. Nyberg, M. Patwary, M. Shoeybi, Y. Choi, and B. Catanzaro. Front-loading reasoning: The synergy between pretraining and post-training data, 2025. URL https://arxiv.org/abs/2510.03264

  4. [4]

    J. E. Anton de la Fuente. Thought editing: Steering models by editing their chain of thought, 2026. URL https://www.lesswrong.com/posts/KXR5FNs4hHT5sMRti/ steering-models-by-editing-their-chain-of-thought

  5. [5]

    Arditi, O

    A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction, 2024. URLhttps://arxiv.org/abs/2406. 11717

  6. [6]

    Activationsteeringforchain-of-thoughtcompression,

    S.Azizi,E.B.Potraghloo,andM.Pedram. Activationsteeringforchain-of-thoughtcompression,

  7. [7]

    URLhttps://arxiv.org/abs/2507.04742

  8. [8]

    J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. URL https://arxiv.org/abs/2309. 16609

  9. [9]

    Bello, A

    F. Bello, A. Das, F. Zeng, F. Yin, and L. Leqi. Linear representation transferability hypothesis: Leveragingsmallmodelstosteerlargemodels,2025. URL https://arxiv.org/abs/2506.00653

  10. [10]

    Berman, A

    N. Berman, A. Hallak, and A. Shocher. Who said neural networks aren’t linear?, 2025. URL https://arxiv.org/abs/2510.08570

  11. [11]

    Buzzega, R

    P. Buzzega, R. Salami, A. Porrello, and S. Calderara. Rethinking layer-wise model merging through chain of merges, 2025. URLhttps://arxiv.org/abs/2508.21421

  12. [12]

    arXiv preprint arXiv:2503.08727 , year=

    L.Caccia,A.Ansell,E.Ponti,I.Vulić,andA.Sordoni. Trainingplug-n-playknowledgemodules with deep context distillation, 2025. URLhttps://arxiv.org/abs/2503.08727

  13. [13]

    H. Chen, C. Vondrick, and C. Mao. Selfie: Self-interpretation of large language model embed- dings, 2024. URLhttps://arxiv.org/abs/2403.10949

  14. [15]

    URLhttps://arxiv.org/abs/2110.14168

  15. [16]

    Csordás, C

    R. Csordás, C. D. Manning, and C. Potts. Do language models use their depth efficiently? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=Kz6eUL86XP

  16. [17]

    G. Cui, Y. Zhang, J. Chen, L. Yuan, Z. Wang, Y. Zuo, H. Li, Y. Fan, H. Chen, W. Chen, Z. Liu, H. Peng, L. Bai, W. Ouyang, Y. Cheng, B. Zhou, and N. Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URLhttps://arxiv.org/abs/ 2505.22617

  17. [18]

    Nudging: Inference-timealignmentofllmsviaguideddecoding,

    Y.Fei,Y.Razeghi,andS.Singh. Nudging: Inference-timealignmentofllmsviaguideddecoding,

  18. [19]

    URLhttps://arxiv.org/abs/2410.09300

  19. [20]

    Ghandeharioun, A

    A. Ghandeharioun, A. Caciularu, A. Pearce, L. Dixon, and M. Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models, 2024. URLhttps:// arxiv.org/abs/2401.06102

  20. [21]

    Y. Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models, 2025. URLhttps://arxiv.org/abs/2306.08543

  21. [22]

    2024 , journal =

    W. Gurnee and M. Tegmark. Language models represent space and time, 2024. URLhttps: //arxiv.org/abs/2310.02207. 16

  22. [23]

    C. He, R. Luo, Y. Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y. Huang, Y. Zhang, J. Liu, L. Qi, Z. Liu, and M. Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URLhttps://arxiv.org/abs/ 2402.14008

  23. [25]

    URLhttps://arxiv.org/abs/2103.03874

  24. [26]

    Thereasoning-memorizationinterplayinlanguage models is mediated by a single direction, 2025

    Y.Hong,D.Zhou,M.Cao,L.Yu,andZ.Jin. Thereasoning-memorizationinterplayinlanguage models is mediated by a single direction, 2025. URLhttps://arxiv.org/abs/2503.23084

  25. [27]

    Huang, P.-Z

    S.-C. Huang, P.-Z. Li, Y.-C. Hsu, K.-M. Chen, Y. T. Lin, S.-K. Hsiao, R. T.-H. Tsai, and H. yi Lee. Chat vector: A simple approach to equip llms with instruction following and model alignment in new languages, 2024. URLhttps://arxiv.org/abs/2310.04799

  26. [28]

    Huang, C

    Y. Huang, C. Huang, D. Feng, W. Lei, and J. Lv. Cross-model transferability among large language models on the platonic representations of concepts, 2025. URLhttps://arxiv.org/ abs/2501.02009

  27. [29]

    The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

    M.Huh,B.Cheung,T.Wang,andP.Isola. Theplatonicrepresentationhypothesis.arXivpreprint arXiv:2405.07987, 2024. URLhttps://arxiv.org/abs/2405.07987

  28. [30]

    Reinforcement Learning via Self-Distillation

    J. Hübotter, F. Lübeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, and A. Krause. Reinforcement learning via self-distillation, 2026. URLhttps://arxiv.org/abs/2601.20802

  29. [31]

    Editing Models with Task Arithmetic

    G.Ilharco,M.T.Ribeiro,M.Wortsman,S.Gururangan,L.Schmidt,H.Hajishirzi,andA.Farhadi. Editing models with task arithmetic, 2023. URLhttps://arxiv.org/abs/2212.04089

  30. [32]

    arXiv preprint arXiv:2512.05117 (2025)

    P.Kaushik,S.Chaudhari,A.Vaidya,R.Chellappa,andA.Yuille. Theuniversalweightsubspace hypothesis, 2025. URLhttps://arxiv.org/abs/2512.05117

  31. [33]

    Konen, S

    K. Konen, S. Jentzsch, D. Diallo, P. Schütt, O. Bensch, R. E. Baff, D. Opitz, and T. Hecking. Style vectors for steering generative large language model, 2024. URLhttps://arxiv.org/abs/2402. 01618

  32. [34]

    Solving Quantitative Reasoning Problems with Language Models

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur, G. Gur-Ari, and V. Misra. Solving quantitative reasoning problems with language models, 2022. URLhttps://arxiv.org/abs/ 2206.14858

  33. [35]

    Li, J.-C

    K. Li, J.-C. Pang, and Y. Yu. Rlvr training of llms does not improve thinking ability for general qa: Evaluation method and a simple solution, 2026. URLhttps://arxiv.org/abs/2603.20799

  34. [36]

    arXiv preprint arXiv:2401.08190

    M.Liao,W.Luo,C.Li,J.Wu,andK.Fan. Mario: Mathreasoningwithcodeinterpreteroutput–a reproducible pipeline.arXiv preprint arXiv:2401.08190, 2024

  35. [37]

    A. Liu, X. Han, Y. Wang, Y. Tsvetkov, Y. Choi, and N. A. Smith. Tuning language models by proxy, 2024. URLhttps://arxiv.org/abs/2401.08565

  36. [38]

    A. H. Liu, K. Khandelwal, S. Subramanian, V. Jouault, A. Rastogi, A. Sadé, A. Jeffares, A. Jiang, A. Cahill, A. Gavaudan, A. Sablayrolles, A. Héliou, A. You, A. Ehrenberg, A. Lo, A. Eliseev, A. Calvi, A. Sooriyarachchi, B. Bout, B. Rozière, B. D. Monicault, C. Lanfranchi, C. Barreau, C. Courtot, D. Grattarola, D. Dabert, D. de las Casas, E. Chane-Sane, F....

  37. [39]

    Midtrainingbridgespretrainingandposttrainingdistributions,

    E.Liu,G.Neubig,andC.Xiong. Midtrainingbridgespretrainingandposttrainingdistributions,

  38. [40]

    URLhttps://arxiv.org/abs/2510.14865

  39. [41]

    S. Liu, H. Ye, L. Xing, and J. Zou. In-context vectors: Making in context learning more effective andcontrollablethroughlatentspacesteering,2024. URL https://arxiv.org/abs/2311.06668

  40. [42]

    S.-Y. Liu, X. Dong, X. Lu, S. Diao, M. Liu, M.-H. Chen, H. Yin, Y.-C. F. Wang, K.-T. Cheng, Y. Choi, et al. Dler: Doing length penalty right-incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110, 2025

  41. [43]

    Efficient Estimation of Word Representations in Vector Space

    T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space, 2013. URLhttps://arxiv.org/abs/1301.3781

  42. [44]

    InProceedings of the 62nd annual meet- ing of the association for computational linguistics (volume 1: Long papers), pages 15789–15809

    D. Nguyen, A. Prasad, E. Stengel-Eskin, and M. Bansal. Grains: Gradient-based attribution for inference-time steering of llms and vlms.arXiv preprint arXiv:2507.18043, 2025. URL https://arxiv.org/abs/2507.18043

  43. [45]

    T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y. Gu, S. Huang, M.Jordan,N.Lambert,D.Schwenk,O.Tafjord,T.Anderson,D.Atkinson,F.Brahman,C.Clark, P. Dasigi, N. Dziri, A. Ettinger, M. Guerquin, D. Heineman, H. Ivison, P. W. Koh, J. Liu, S. Malik, W.Merrill,L.J.V.Miranda,J.Morrison,T.Murray,C.Nam,J.Poznanski,V.Pyatkin,A.Rangapur, M...

  44. [46]

    Oozeer, D

    N. Oozeer, D. Nathawani, N. Prakash, M. Lan, A. Harrasse, and A. Abdullah. Activation space interventions can be transferred between large language models, 2025. URLhttps: //arxiv.org/abs/2503.04429

  45. [47]

    RAST: reasoning activation in llms via small-model transfer

    S. Ouyang, X. Zhu, Z. Xiao, M. Jiang, Y. Meng, and J. Han. Rast: Reasoning activation in llms via small-model transfer, 2025. URLhttps://arxiv.org/abs/2506.15710

  46. [48]

    Steering Llama 2 via Contrastive Activation Addition

    N. Panickssery, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering llama 2 via contrastive activation addition, 2024. URLhttps://arxiv.org/abs/2312.06681

  47. [49]

    K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models, 2024. URLhttps://arxiv.org/abs/2311.03658

  48. [50]

    Are NLP Models really able to Solve Simple Math Word Problems?

    A. Patel, S. Bhattamishra, and N. Goyal. Are NLP models really able to solve simple math word problems? In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: ...

  49. [51]

    Y. Qin, Y. Lin, J. Yi, J. Zhang, X. Han, Z. Zhang, Y. Su, Z. Liu, P. Li, M. Sun, and J. Zhou. Knowledge inheritance for pre-trained language models, 2022. URLhttps://arxiv.org/abs/ 2105.13880

  50. [52]

    Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z...

  51. [54]

    Analysing mathematical reasoning abilities of neural models

    D. Saxton, E. Grefenstette, F. Hill, and P. Kohli. Analysing mathematical reasoning abilities of neural models, 2019. URLhttps://arxiv.org/abs/1904.01557

  52. [55]

    Z. Shen, H. Yan, L. Zhang, Z. Hu, Y. Du, and Y. He. Codi: Compressing chain-of-thought into continuous space via self-distillation, 2025. URLhttps://arxiv.org/abs/2502.21074. 18

  53. [56]

    Self-distillationenablescontinuallearning,

    I.Shenfeld, M.Damani, J.Hübotter, andP.Agrawal. Self-distillationenablescontinuallearning,

  54. [57]

    URLhttps://arxiv.org/abs/2601.19897

  55. [58]

    Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

    O. Skean, M. R. Arefin, D. Zhao, N. Patel, J. Naghiyev, Y. LeCun, and R. Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models, 2025. URLhttps://arxiv.org/ abs/2502.02013

  56. [59]

    Stoehr, K

    N. Stoehr, K. Du, V. Snæbjarnarson, R. West, R. Cotterell, and A. Schein. Activation scaling for steering and interpreting language models, 2024. URLhttps://arxiv.org/abs/2410.04962

  57. [60]

    Improvinginstruction-following in language models through activation steering, 2025

    A.Stolfo,V.Balachandran,S.Yousefi,E.Horvitz,andB.Nushi. Improvinginstruction-following in language models through activation steering, 2025. URLhttps://arxiv.org/abs/2410. 12877

  58. [61]

    D. Tan, D. Chanin, A. Lynch, D. Kanoulas, B. Paige, A. Garriga-Alonso, and R. Kirk. Analyzing the generalization and reliability of steering vectors, 2025. URLhttps://arxiv.org/abs/2407. 12404

  59. [62]

    G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024. URLhttps://arxiv.org/abs/2408.00118

  60. [63]

    G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, L. Rouillard, T. Mesnard, G. Cideron, J. bastien Grill, S. Ramos, E. Yvinec, M. Casbon, E. Pot, I. Penchev, G. Liu, F. Visin, K. Kenealy, L. Beyer, X. Zhai, A. Tsitsulin, R. Busa-Fekete, A. Feng, N. Sachdeva, B. Coleman, Y. Gao, B. Mustafa, ...

  61. [65]

    E. Todd, M. L. Li, A. S. Sharma, A. Mueller, B. C. Wallace, and D. Bau. Function vectors in large language models, 2024. URLhttps://arxiv.org/abs/2310.15213

  62. [66]

    A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering, 2024. URLhttps://arxiv.org/abs/ 2308.10248

  63. [67]

    Extending activation steering to broad skills and multiple behaviours.arXiv preprint arXiv:2403.05767,

    T. van der Weij, M. Poesio, and N. Schoots. Extending activation steering to broad skills and multiple behaviours, 2024. URLhttps://arxiv.org/abs/2403.05767. 19

  64. [68]

    Base models know how to reason, thinking models learn when.arXiv preprint arXiv:2510.07364,

    C. Venhoff, I. Arcuschin, P. Torr, A. Conmy, and N. Nanda. Base models know how to reason, thinking models learn when, 2025. URLhttps://arxiv.org/abs/2510.07364

  65. [69]

    arXiv preprint arXiv:2506.18167 , year=

    C.Venhoff,I.Arcuschin,P.Torr,A.Conmy,andN.Nanda. Understandingreasoninginthinking language models via steering vectors, 2025. URLhttps://arxiv.org/abs/2506.18167

  66. [70]

    E. P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y. Gu, S. Huang, M. Jordan, N. Lambert, D. Schwenk, O. Tafjord, T. Anderson, D. Atkinson, F. Brahman, C. Clark, P. Dasigi, N. Dziri, A. Ettinger, M. Guerquin, D. Heineman, H. Ivison, P. W. Koh, J. Liu, S. Malik, W. Mer- rill, L. J. V. Miranda, J. Morrison, T. Murray, C. Nam, J. Poznanski...

  67. [71]

    F. Wan, L. Zhong, Z. Yang, R. Chen, and X. Quan. Fusechat: Knowledge fusion of chat models,

  68. [72]

    URLhttps://arxiv.org/abs/2408.07990

  69. [73]

    B. Wang, C. Lee, N. Lee, S.-C. Lin, W. Dai, Y. Chen, Y. Chen, Z. Yang, Z. Liu, M. Shoeybi, B. Catanzaro, and W. Ping. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models, 2025. URLhttps://arxiv.org/abs/2512.13607

  70. [74]

    J. Wang, Y. Chen, Z. Li, and C. Huang. Lightreasoner: Can small language models teach large language models reasoning?, 2025. URLhttps://arxiv.org/abs/2510.07962

  71. [75]

    S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X. Chen, J. Yang, Z. Zhang, Y. Liu, A. Yang, A. Zhao, Y. Yue, S. Song, B. Yu, G. Huang, and J. Lin. Beyond the 80/20 rule: High- entropy minority tokens drive effective reinforcement learning for llm reasoning, 2025. URL https://arxiv.org/abs/2506.01939

  72. [76]

    Z. Wang, F. Zhou, X. Li, and P. Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling, 2025. URLhttps://arxiv.org/abs/2506.20512

  73. [77]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URLhttps: //arxiv.org/abs/2201.11903

  74. [78]

    T. Wu, R. Yang, J. Li, P. Hu, Y.-C. Wu, N. Wong, and Y. Yang. Shadow-ft: Tuning instruct model via training on paired base model, 2025. URLhttps://arxiv.org/abs/2505.12716

  75. [79]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/ 2505.09388

  76. [80]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025. URLhttps://arxiv.org/abs/ 2412.15115

  77. [81]

    Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837

  78. [82]

    Zbeeb, H

    M. Zbeeb, H. A. A. K. Hammoud, and B. Ghanem. Reasoning vectors: Transferring chain-of- thought capabilities via task arithmetic, 2025. URLhttps://arxiv.org/abs/2509.01363

  79. [83]

    [[rating]]

    W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URLhttps://arxiv.org/ abs/2304.06364

  80. [84]

    Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

    Z. Zhong and A. Raghunathan. Watch the weights: Unsupervised monitoring and control of fine-tuned llms, 2025. URLhttps://arxiv.org/abs/2508.00161

Showing first 80 references.