pith. sign in

arxiv: 2606.11172 · v1 · pith:ROBAFCHVnew · submitted 2026-06-09 · 💻 cs.LG

Predicting Future Behaviors in Reasoning Models Enables Better Steering

Pith reviewed 2026-06-27 13:35 UTC · model grok-4.3

classification 💻 cs.LG
keywords large reasoning modelsactivation probesfuture behavior predictionmodel steeringtext generation controlinterpretabilitybehavior control
0
0 comments X

The pith

Probes that predict future behaviors from intermediate steps enable steering of reasoning models with little quality loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that prior steering of large reasoning models has relied on internal features detecting behavior in already-generated text, yet these features predict future outcomes poorly. It trains activation probes on intermediate reasoning steps to forecast the likelihood of specific future behaviors, reaching 64 to 91 percent accuracy and exposing a distinct class of prediction features. From these features the authors build Future Probe Controlled Generation, which draws multiple candidate sentences and retains the one whose probe score indicates the highest chance of the desired future behavior. The resulting method steers model outputs effectively while avoiding the quality degradation seen in activation steering and succeeds in test cases where activation steering fails.

Core claim

Internal representations contain separate prediction features that forecast future behavioral outcomes from intermediate reasoning steps; training probes on these features and using them at the text level to select among sampled candidates produces steering that preserves output quality and works where detection-based activation steering does not.

What carries the argument

Future Probe Controlled Generation (FPCG), a sampling-and-selection procedure that scores candidate sentences with a probe trained to predict future behavior likelihood.

If this is right

  • Steering can be performed with almost no degradation in output quality.
  • FPCG succeeds in several evaluations where activation steering fails.
  • Prediction probes reach 64 to 91 percent accuracy at identifying the most likely future behavior.
  • Distinguishing detection features from prediction features supports a more nuanced approach to controlling model behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of detection and prediction features could be tested in non-reasoning language models to check whether the distinction is general.
  • FPCG could be applied to safety-relevant behaviors such as reducing specific failure modes to measure real-world utility.
  • Combining future-behavior probes with other forms of steering might yield additive control without compounding quality costs.

Load-bearing premise

Internal activations at intermediate reasoning steps contain information about future behavioral outcomes that is distinct from and more useful than information about behavior already present in the text.

What would settle it

An experiment in which future-behavior probes are trained yet FPCG produces no measurable improvement in steering success rate or quality retention compared with random candidate selection or standard activation steering.

Figures

Figures reproduced from arXiv: 2606.11172 by Evgenii Kortukov, Florian Klein, Gabriele Sarti, Paula Engl, Piotr Komorowski, Sebastian Lapuschkin, Seong Joon Oh, Wojciech Samek.

Figure 1
Figure 1. Figure 1: LLMs have distinct features for detecting past and predicting future behaviors, enabling steering. Left: Existing steering methods use contrastive response activations that capture detection features (top). A distinct set of LLM features enables future behavior prediction (bottom). Right: The proposed FPCG algorithm samples candidate sentences and selects the best using an activation probe that predicts fu… view at source ↗
Figure 2
Figure 2. Figure 2: Fraction of behaviorally uncertain prompts in each behavioral dataset. Refusal behaviors (SORRY-Bench, Xie et al. [2025]), Prompt injections (SEP, Zverev et al. [2025]), and Sycophancy (ELEPHANT-AITA, Cheng et al. [2026]). In Appendix B, we provide dataset examples and details on evaluation procedures. With this setup, we aim to reflect a broad range of the realistic choices deployed models make in user in… view at source ↗
Figure 3
Figure 3. Figure 3: Behavior distribution dynamics for two example responses to the same prompt. Example from Refusal (SORRY-Bench) evaluation of Deepseek-R1-Distill-Llama-8B. An example result of this analysis for two responses of Deepseek-R1-Distill-Llama-8B to a prompt from Refusal evaluation can be seen in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance of the Linear Probe predicting output behavior probabilities. Mean Absolute Error (top) and Binarized Accuracy (bottom), with random and mean baselines as dashed lines. gpt-oss-20b, layer 20. In the training and evaluation datasets, each activation is paired with the probability label, gathered in Section 3.2. For each behavioral dataset, we use two disjoint subsets of 100 for training and eval… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between Linear Probes predicting the future behavior trained on all response sentences (Prediction features) vs. only trained on the final answer activations (Detection features). detection features, as opposed to behavior prediction features which capture what the model intends to do in future generation. Note, that this is a standard way to extract behavior representations as used, for example… view at source ↗
Figure 6
Figure 6. Figure 6: Future Probe Controlled Generation. We propose the Future Probe Controlled Generation algorithm, presented in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Difference-in-Means steering performance in controlling the behavior of DeepSeek-R1-Distill-Llama-8B. We sweep over steering multipliers. The numbers above indicate the strongest steering with additional < 10% filtered out examples. Yellow bars show the fraction of examples filtered out due to not following the response format — a proxy for strong performance degradation. Dashed lines show performance of n… view at source ↗
Figure 8
Figure 8. Figure 8: Average perplexity of model generations steered with FPCG and Activation Steering. If we limit ourselves to steering multipliers with less than 10% output degradation, we find that FPCG performs comparably to activation steering in steering strength. In these setups, FPCG offers stronger steering in Myopic Reward, Survival Instinct and Prompt Injection, while performing comparably in Wealth Seeking and sli… view at source ↗
Figure 9
Figure 9. Figure 9: Sample prompt from the Myopic Reward dataset, with a response generated by DeepSeek-R1-Distill-Llama-8B. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sample prompt from the Wealth Seeking dataset, with a response generated by DeepSeek-R1-Distill-Llama-8B. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sample prompt from the Survival Instinct dataset, with a response generated by DeepSeek-R1-Distill-Llama-8B. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sample prompt from the SORRY-Bench dataset, with a response generated by DeepSeek-R1-Distill-Llama-8B. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sample prompt from the SEP dataset, with a response generated by DeepSeek-R1-Distill-Llama-8B. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Sample prompt from the ELEPHANT-AITA dataset, with a response generated by DeepSeek-R1-Distill-Llama-8B. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Linear vs. MLP Probes comparison on predicting output behavior distributions. C.2 Full Behavior Detection vs. Behavior Prediction Features In this section we report results on the comparison between detection vs. prediction features following Section 3.3.2. We report full results in [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison between Linear Probes predicting the future behavior trained on all response sentences (Prediction features) vs. only trained on the final answer activations (Detection features) for all studied models. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Activation steering by layer for DeepSeek-R1-Distill-Llama-8B. Each subplot reports Average Behavior Probability and the fraction of examples filtered out (yellow bars). 0% 25% 50% 75% 100% Layer 20 Avg. Behavior Prob. 28 42 15 27 5.0 77.0 Myopic Reward [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: , gpt-oss-20b in [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Activation steering by layer for gpt-oss-20b. Each subplot reports Average Behavior Probability and the fraction of examples filtered out (yellow bars). 0% 25% 50% 75% 100% Layer 32 Avg. Behavior Prob. 35 6 59 2.3 82.7 Myopic Reward [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Activation steering by layer for QwQ-32B. Each subplot reports Average Behavior Probability and the fraction of examples filtered out (yellow bars). D.2 FPCG vs Activation Steering In this section we present the full comparison between FPCG and activation steering for all three studied models. In [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: FPCG (Layer 15) vs. activation steering (Layer 25) for DeepSeek-R1-Distill-Llama-8B. Negative No Steering Positive 0% 20% 40% 60% 80% 100% Avg. Behavior Probability Myopic Reward 4.3 26.3 77.3 Negative No Steering Positive 0% 20% 40% 60% 80% 100% Avg. Behavior Probability Wealth Seeking 5.0 13.3 38.7 Negative No Steering Positive 0% 20% 40% 60% 80% 100% Avg. Behavior Probability Survival Instinct 33.0 50.… view at source ↗
Figure 22
Figure 22. Figure 22: FPCG (Layer 32) vs. activation steering (Layer 20) for Qwen3-14B. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: FPCG (Layer 20) vs. activation steering (Layer 20) for gpt-oss-20b. Negative No Steering Positive 0% 20% 40% 60% 80% 100% Avg. Behavior Probability Myopic Reward 3.0 18.0 69.0 Negative No Steering Positive 0% 20% 40% 60% 80% 100% Avg. Behavior Probability Wealth Seeking 4.3 13.3 32.0 Negative No Steering Positive 0% 20% 40% 60% 80% 100% Avg. Behavior Probability Survival Instinct 19.7 41.0 70.0 Negative N… view at source ↗
Figure 24
Figure 24. Figure 24: FPCG (Layer 50) vs. activation steering (Layer 32) for QwQ-32B. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Ablation analysis of num_candidates for DeepSeek-R1-Distill-Llama-8B. 2 3 5 10 15 num_candidates 0 20 40 60 80 100 Avg. Behavior Probability (%) 26.3 10.0 6.0 5.7 4.3 5.7 48.0 59.7 70.3 77.3 74.7 Myopic Reward 2 3 5 10 15 num_candidates 0 20 40 60 80 100 13.3 8.7 4.3 5.3 5.0 5.3 21.3 25.7 32.7 38.7 40.7 Wealth Seeking 2 3 5 10 15 num_candidates 0 20 40 60 80 100 50.3 41.0 39.0 34.3 33.0 33.3 60.7 66.7 67.… view at source ↗
Figure 26
Figure 26. Figure 26: Ablation analysis of num_candidates for Qwen3-14B. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗
read the original abstract

Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitly relies on internal features that detect behavior in already generated text. We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target. Instead, we train activation probes to predict future behavior likelihoods from intermediate reasoning steps. These probes predict the most likely behavior with 64%-91% accuracy, revealing a separate type of internal prediction features. Building on these prediction features, we introduce a text-level steering method, Future Probe Controlled Generation. FPCG samples multiple candidate sentences and chooses the best one according to a probe predicting the future behavior likelihood. This enables steering with almost no output quality degradation. FPCG also enables steering in several evaluations where activation steering fails. These results show that distinguishing detection and prediction features enables a more nuanced approach to controlling LRM behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that detection features used in prior test-time steering of large reasoning models (LRMs) are poor predictors of future behaviors, while newly trained activation probes can predict the most likely future behavior from intermediate steps with 64-91% accuracy. It introduces Future Probe Controlled Generation (FPCG), which samples candidate sentences and selects via the probe to steer behavior with minimal quality degradation and in cases where activation steering fails.

Significance. If the empirical distinction between detection and prediction features is robustly demonstrated and FPCG is shown to outperform baselines without hidden costs, the work would clarify the role of internal representations in LRMs and supply a practical text-level steering technique that preserves output quality better than existing activation methods.

major comments (1)
  1. [Abstract] Abstract: the central motivation asserts that 'these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target,' yet no accuracy figures, baselines, or direct comparisons are supplied for any detection probe on the future-prediction task; without this contrast the claim that the new probes reveal a 'separate type' of superior features remains unsupported.
minor comments (2)
  1. The abstract reports accuracy ranges (64%-91%) and qualitative improvements but supplies no dataset details, number of models evaluated, statistical tests, or ablation results; these must be added to the main text with explicit baselines for the future-prediction task.
  2. Clarify how future-behavior labels are obtained for probe training and whether the same labeled data is used to evaluate detection probes, to allow readers to assess the reported separation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the opportunity to clarify the presentation of our central claim. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central motivation asserts that 'these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target,' yet no accuracy figures, baselines, or direct comparisons are supplied for any detection probe on the future-prediction task; without this contrast the claim that the new probes reveal a 'separate type' of superior features remains unsupported.

    Authors: We agree that the abstract, as currently written, does not supply the quantitative contrast that would make the motivation self-contained. The main text does contain the relevant experiments: detection probes (trained on already-generated text) achieve substantially lower accuracy when evaluated on the future-behavior prediction task than the newly trained prediction probes (64-91%). To make this distinction explicit at the level of the abstract and to directly support the claim of a 'separate type' of features, we will revise the abstract to include a concise statement of the accuracy ranges for both classes of probes together with the direct comparison on the future-prediction task. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and evaluation

full rationale

The paper's core contributions consist of training activation probes on labeled data to predict future behavior likelihoods from intermediate reasoning steps, reporting empirical accuracies of 64%-91%, and introducing FPCG as a sampling-based steering method that uses these probes. These steps are standard supervised learning and experimental evaluation; no derivation chain reduces a claimed prediction or first-principles result to its own inputs by construction. No self-definitional equations, fitted parameters renamed as independent predictions, or load-bearing self-citations appear in the abstract or described claims. The work is self-contained against external benchmarks via reported accuracies and steering outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of predictive information in intermediate hidden states and on the effectiveness of supervised probe training, both of which introduce fitted parameters and domain assumptions about model internals.

free parameters (1)
  • activation probe parameters
    Weights and thresholds of the behavior-prediction probes are fitted to training data on hidden activations and future outcomes.
axioms (1)
  • domain assumption Hidden representations in large reasoning models contain information about future behavioral outcomes
    Invoked to justify training probes on intermediate reasoning steps rather than final outputs.

pith-pipeline@v0.9.1-grok · 5725 in / 1181 out tokens · 26783 ms · 2026-06-27T13:35:07.419066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 9 canonical work pages

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    Causality and Large Models @NeurIPS 2024 , year=

    Counterfactual Token Generation in Large Language Models , author=. Causality and Large Models @NeurIPS 2024 , year=

  5. [5]

    The Thirteenth International Conference on Learning Representations , year=

    Gumbel Counterfactual Generation From Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  6. [6]

    International Conference on Learning Representations , year=

    Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search , author=. International Conference on Learning Representations , year=

  7. [7]

    2023 , eprint=

    Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

  8. [8]

    2025 , eprint=

    Control Illusion: The Failure of Instruction Hierarchies in Large Language Models , author=. 2025 , eprint=

  9. [9]

    2024 , eprint=

    Can LLMs Follow Simple Rules? , author=. 2024 , eprint=

  10. [10]

    2025 , eprint=

    Measuring AI Ability to Complete Long Tasks , author=. 2025 , eprint=

  11. [11]

    Prompt Injection attack against

    Yi Liu and Gelei Deng and Yuekang Li and Kailong Wang and Zihao Wang and Xiaofeng Wang and Tianwei Zhang and Yepang Liu and Haoyu Wang and Yan Zheng and Yang Liu , year=. Prompt Injection attack against

  12. [12]

    Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , title =

  13. [13]

    Ignore Previous Prompt: Attack Techniques For Language Models , author=

  14. [14]

    The instruction hierarchy: Training

    Wallace, Eric and Xiao, Kai and Leike, Reimar and Weng, Lilian and Heidecke, Johannes and Beutel, Alex , journal=. The instruction hierarchy: Training

  15. [15]

    Foundational Challenges in Assuring Alignment and Safety of Large Language Models , author=

  16. [16]

    Multi-property Steering of Large Language Models with Dynamic Activation Composition

    Scalena, Daniel and Sarti, Gabriele and Nissim, Malvina. Multi-property Steering of Large Language Models with Dynamic Activation Composition. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.34

  17. [17]

    Better & faster large language models via multi-token prediction , year =

    Gloeckle, Fabian and Idrissi, Badr Youbi and Rozi\`. Better & faster large language models via multi-token prediction , year =. Proceedings of the 41st International Conference on Machine Learning , articleno =

  18. [18]

    Accelerating Transformer Inference for Translation via Parallel Decoding

    Santilli, Andrea and Severino, Silvio and Postolache, Emilian and Maiorca, Valentino and Mancusi, Michele and Marin, Riccardo and Rodola, Emanuele. Accelerating Transformer Inference for Translation via Parallel Decoding. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/...

  19. [19]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Evidence of Learned Look-Ahead in a Chess-Playing Neural Network , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  20. [20]

    The Thirteenth International Conference on Learning Representations , year=

    Interpreting Emergent Planning in Model-Free Reinforcement Learning , author=. The Thirteenth International Conference on Learning Representations , year=

  21. [21]

    Lindsey, Jack and Gurnee, Wes and Ameisen, Emmanuel and Chen, Brian and Pearce, Adam and Turner, Nicholas L. and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

  22. [22]

    Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

    Ortu, Francesco and Jin, Zhijing and Doimo, Diego and Sachan, Mrinmaya and Cazzaniga, Alberto and Sch. Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

  23. [23]

    ICML , crossref=

    Reduan Achtibat and Sayed Mohammad Vakilzadeh Hatefi and Maximilian Dreyer and Aakriti Jain and Thomas Wiegand and Sebastian Lapuschkin and Wojciech Samek , title=. ICML , crossref=. 2024 , cdate=

  24. [24]

    arXiv preprint arXiv:2403.08319 , year=

    Knowledge Conflicts for LLMs: A Survey , author=. arXiv preprint arXiv:2403.08319 , year=

  25. [25]

    Lampert , booktitle=

    Egor Zverev and Evgenii Kortukov and Alexander Panfilov and Soroush Tabesh and Sebastian Lapuschkin and Wojciech Samek and Christoph H. Lampert , booktitle=. 2025 , url=

  26. [26]

    First Conference on Language Modeling , year=

    Studying Large Language Model Behaviors Under Context-Memory Conflicts With Real Documents , author=. First Conference on Language Modeling , year=

  27. [27]

    2025 , eprint=

    The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation , author=. 2025 , eprint=

  28. [28]

    2025 , url =

    Anthropic , title =. 2025 , url =

  29. [29]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  30. [30]

    2025 , url =

    Meta , title =. 2025 , url =

  31. [31]

    Mechanistic Interpretability for

    Leonard Bereska and Stratis Gavves , journal=. Mechanistic Interpretability for. 2024 , url=

  32. [32]

    2023 , eprint=

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

  33. [33]

    arXiv preprint arXiv:2503.03750 , year=

    The mask benchmark: Disentangling honesty from accuracy in ai systems , author=. arXiv preprint arXiv:2503.03750 , year=

  34. [34]

    The Twelfth International Conference on Learning Representations , year=

    Towards Understanding Sycophancy in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  35. [35]

    Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , author=

    SycEval: Evaluating LLM Sycophancy , volume=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , author=. 2025 , month=. doi:10.1609/aies.v8i1.36598 , number=

  36. [36]

    2022 , eprint=

    Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

  37. [37]

    arXiv preprint arXiv:2412.04984 , year=

    Frontier models are capable of in-context scheming , author=. arXiv preprint arXiv:2412.04984 , year=

  38. [38]

    2025 , eprint=

    LLM-Safety Evaluations Lack Robustness , author=. 2025 , eprint=

  39. [39]

    The Thirteenth International Conference on Learning Representations , year=

    A Probabilistic Perspective on Unlearning and Alignment for Large Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  40. [40]

    and Bau, David , year = 2023, eprint =

    Pal, Koyena and Sun, Jiuding and Yuan, Andrew and Wallace, Byron and Bau, David. Future Lens: Anticipating Subsequent Tokens from a Single Hidden State. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). 2023. doi:10.18653/v1/2023.conll-1.37

  41. [41]

    2024 , booktitle=

    Do language models plan ahead for future tokens? , author=. 2024 , booktitle=

  42. [42]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Park, Kiho and Choe, Yo Joong and Veitch, Victor , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  43. [43]

    2022 , journal=

    Toy Models of Superposition , author=. 2022 , journal=

  44. [44]

    arXiv preprint arXiv:2310.08215 , year=

    Trustworthy Machine Learning , author=. arXiv preprint arXiv:2310.08215 , year=

  45. [45]

    2017 , url=

    Understanding intermediate layers using linear classifier probes , author=. 2017 , url=

  46. [46]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  47. [47]

    USENIX Security Symposium , year=

    StruQ: Defending against prompt injection with structured queries , author=. USENIX Security Symposium , year=

  48. [48]

    Instructional Segment Embedding: Improving

    Tong Wu and Shujian Zhang and Kaiqiang Song and Silei Xu and Sanqiang Zhao and Ravi Agrawal and Sathish Reddy Indurthi and Chong Xiang and Prateek Mittal and Wenxuan Zhou , booktitle=. Instructional Segment Embedding: Improving. 2025 , url=

  49. [49]

    Lampert , booktitle=

    Egor Zverev and Sahar Abdelnabi and Soroush Tabesh and Mario Fritz and Christoph H. Lampert , booktitle=. Can. 2025 , url=

  50. [50]

    Strategic Dishonesty Can Undermine

    Alexander Panfilov and Evgenii Kortukov and Kristina Nikoli. Strategic Dishonesty Can Undermine. The Fourteenth International Conference on Learning Representations , year=

  51. [51]

    Blog , year=

    Interpretability Can Be Actionable , author=. Blog , year=

  52. [52]

    Forty-second International Conference on Machine Learning , year=

    Detecting Strategic Deception with Linear Probes , author=. Forty-second International Conference on Machine Learning , year=

  53. [53]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  54. [54]

    Steering llama 2 via contrastive activation addition

    Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

  55. [55]

    2022 , eprint=

    Discovering Language Model Behaviors with Model-Written Evaluations , author=. 2022 , eprint=

  56. [56]

    The Thirteenth International Conference on Learning Representations , year=

    SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal , author=. The Thirteenth International Conference on Learning Representations , year=

  57. [57]

    2026 , url=

    Myra Cheng and Sunny Yu and Cinoo Lee and Pranav Khadpe and Lujain Ibrahim and Dan Jurafsky , booktitle=. 2026 , url=

  58. [58]

    2025 , eprint=

    When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors , author=. 2025 , eprint=

  59. [59]

    The Thirteenth International Conference on Learning Representations , year=

    Forking Paths in Neural Text Generation , author=. The Thirteenth International Conference on Learning Representations , year=

  60. [60]

    2025 , eprint=

    Thought Anchors: Which LLM Reasoning Steps Matter? , author=. 2025 , eprint=

  61. [61]

    Bogdan and Senthooran Rajamanoharan and Neel Nanda , booktitle=

    Uzay Macar and Paul C. Bogdan and Senthooran Rajamanoharan and Neel Nanda , booktitle=. Thought Branches: Interpreting. 2026 , url=

  62. [62]

    AxBench: Steering

    Zhengxuan Wu and Aryaman Arora and Atticus Geiger and Zheng Wang and Jing Huang and Dan Jurafsky and Christopher D Manning and Christopher Potts , booktitle=. AxBench: Steering. 2025 , url=

  63. [63]

    2025 , url =

    Joschka Braun and Dmitrii Krasheninnikov and Usman Anwar and Robert Kirk and Daniel Tan and David Scott Krueger , title =. 2025 , url =

  64. [64]

    Workshop on Reasoning and Planning for Large Language Models , year=

    Understanding Reasoning in Thinking Language Models via Steering Vectors , author=. Workshop on Reasoning and Planning for Large Language Models , year=

  65. [65]

    2026 , eprint=

    Building Production-Ready Probes For Gemini , author=. 2026 , eprint=

  66. [66]

    Sycophancy in GPT-4o: What happened and what we’re doing about it , year =

  67. [67]

    Where the goblins came from , year =

  68. [68]

    2026 , month = feb, type =

    Claude Opus 4.6 System Card , institution =. 2026 , month = feb, type =

  69. [69]

    2024 , eprint=

    Steering Without Side Effects: Improving Post-Deployment Control of Language Models , author=. 2024 , eprint=

  70. [70]

    2024 , eprint=

    Steering Language Models With Activation Engineering , author=. 2024 , eprint=

  71. [71]

    The Fourteenth International Conference on Learning Representations , year=

    Latent Planning Emerges with Scale , author=. The Fourteenth International Conference on Learning Representations , year=

  72. [72]

    Emergent Response Planning in

    Zhichen Dong and Zhanhui Zhou and Zhixuan Liu and Chao Yang and Chaochao Lu , booktitle=. Emergent Response Planning in. 2025 , url=

  73. [73]

    2025 , eprint=

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models , author=. 2025 , eprint=

  74. [74]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  75. [75]

    Aayush Mishra and Daniel Khashabi and Anqi Liu , booktitle=. Steered. 2026 , url=

  76. [76]

    Distributed Representations of Words and Phrases and their Compositionality , url =

    Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff , booktitle =. Distributed Representations of Words and Phrases and their Compositionality , url =

  77. [77]

    Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , url =

    Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James and Saligrama, Venkatesh and Kalai, Adam T , booktitle =. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , url =

  78. [78]

    Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (

    Kim, Been and Wattenberg, Martin and Gilmer, Justin and Cai, Carrie and Wexler, James and Viegas, Fernanda and sayres, Rory , booktitle =. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (. 2018 , editor =

  79. [79]

    2026 , eprint=

    From Weights to Activations: Is Steering the Next Frontier of Adaptation? , author=. 2026 , eprint=

  80. [80]

    International Conference on Learning Representations , year=

    Plug and Play Language Models: A Simple Approach to Controlled Text Generation , author=. International Conference on Learning Representations , year=

Showing first 80 references.