pith. sign in

arxiv: 2606.26155 · v1 · pith:TVQUCMNGnew · submitted 2026-06-23 · 💻 cs.AI

Detecting and Controlling Sycophancy with Cascading Linear Features

Pith reviewed 2026-06-26 01:23 UTC · model grok-4.3

classification 💻 cs.AI
keywords sycophancyactivation steeringlinear featuresinterpretabilitylanguage modelsbehavior controlcascading samplesdetection
0
0 comments X

The pith

An iterative pipeline generates graded sycophancy samples that isolate linear activation features for detection and steering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method that moves past simple contrastive pairs by creating sequences of samples whose sycophantic content increases in measured steps. These sequences are used to extract directions in the model's activation space that align with the behavior. The resulting directions form subspaces that separate sycophantic from non-sycophantic responses more cleanly than standard approaches. This separation supports direct detection of the behavior, a numerical score without external judges, and activation edits that reduce sycophancy while preserving other capabilities.

Core claim

By constructing cascading samples in which the degree of sycophantic behavior scales linearly with an underlying feature, the method isolates linear directions in activation space that correspond to sycophancy; these directions are linearly separable, permit deterministic scoring of the behavior, and support steering vectors that reduce sycophancy more effectively than LLM-as-a-judge or system-prompt baselines while requiring fewer compute resources.

What carries the argument

The cascading linear features pipeline: an iterative data-generation loop that produces samples with graded sycophancy levels to extract linear directions from activation differences.

If this is right

  • Detection of sycophancy becomes possible by projecting activations onto the extracted direction without needing a separate judge model.
  • Steering is achieved by subtracting a scaled version of the direction from the model's activations during generation.
  • A deterministic numerical score for sycophancy can be read directly from the projection value.
  • The same pipeline can be run on new behaviors once graded samples are available.
  • The approach requires only forward passes on the target model rather than repeated calls to an external judge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graded-sample technique could be tested on other behaviors whose intensity can be varied continuously, such as factual accuracy or refusal strength.
  • If the linear directions remain stable across model sizes or families, the features might transfer without re-extraction.
  • Combining the extracted direction with existing sparse-autoencoder features could further disentangle sycophancy from correlated traits.
  • The method supplies an explicit test for whether a given behavior is represented by a single linear direction rather than a more complex manifold.

Load-bearing premise

Sycophantic behavior can be made to increase in linear proportion to the strength of a single activation-space direction by selecting or generating appropriate samples.

What would settle it

If activations collected from the graded samples do not lie along a single linear direction when the sycophancy level is varied, the extracted features will not separate the behavior cleanly.

read the original abstract

Interpreting and controlling model behaviors through activation steering methods requires many pairs of contrastive samples that clearly exhibit desired or undesired behavior. These data pairs determine the degree to which interpretability frameworks can reliably detect model features responsible for a behavior, and therefore the ability to steer models toward or away from such behavior. In this work, we present an iterative data generation pipeline that isolates cascading linear features responsible for a behavior. Specifically, we show how moving beyond simple binary pairs of samples, and instead isolating samples that show degrees of features that scale linearly with behavior, allows for better disentanglement of features. We focus on detecting and steering away from sycophancy -- the tendency of language models to prioritize user validation. We demonstrate that sycophancy features discovered through cascading samples form linearly separable subspaces, and allow for selection of model activations that more clearly correspond to the desired behavior than baseline approaches. We also evaluate their ability to enable detection, deterministic scoring, and robust steering, and see that they either match or outperform LLM-as-a-judge and system prompting baselines while providing lower computational demand and more interpretability guarantees. Code & Data: https://cascading-feats.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that an iterative pipeline generating cascading samples with sycophancy degrees that scale linearly with activation-space feature strength produces cleaner linearly separable subspaces than binary contrast pairs. These features enable improved detection, deterministic scoring, and steering of sycophancy that match or exceed LLM-as-a-judge and system-prompt baselines while using less compute and offering more interpretability.

Significance. If the linearity assumption holds and the empirical gains are robust, the method would strengthen activation-steering toolkits by replacing ad-hoc contrast pairs with a scalable, degree-controlled data generation process, directly addressing a key bottleneck in mechanistic interpretability for safety-relevant behaviors.

major comments (2)
  1. [Methods / Pipeline description] The central claim rests on the unverified assumption that sycophancy degree scales linearly with the underlying feature strength in the generated cascading samples (weakest assumption in the stress-test note). No correlation analysis, projection-vs-score plots, or ablation confirming this scaling is described, which is load-bearing for attributing any disentanglement or steering gains to the cascading approach rather than to the generation process itself.
  2. [Abstract / Results claims] The abstract asserts that the discovered features 'match or outperform' baselines on detection, scoring, and steering, yet supplies no quantitative metrics, error bars, dataset sizes, number of models tested, or ablation details. This absence prevents assessment of whether post-hoc sample selection or fitting choices drive the reported superiority.
minor comments (1)
  1. The link to code and data is provided, which is a strength; ensure the released artifacts include the exact cascading generation prompts and the behavioral scoring rubric used for the linearity check.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of our pipeline validation and result reporting. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Methods / Pipeline description] The central claim rests on the unverified assumption that sycophancy degree scales linearly with the underlying feature strength in the generated cascading samples (weakest assumption in the stress-test note). No correlation analysis, projection-vs-score plots, or ablation confirming this scaling is described, which is load-bearing for attributing any disentanglement or steering gains to the cascading approach rather than to the generation process itself.

    Authors: We agree that explicit verification of the linear scaling assumption is necessary to isolate the contribution of the cascading pipeline. The manuscript describes the iterative generation process but does not include correlation coefficients, projection-versus-score plots, or dedicated ablations on this point. In the revision we will add these analyses (including Pearson correlations between feature strength and human-annotated sycophancy scores, and ablation removing the linear grading step) to confirm that the observed separability and steering improvements are attributable to the cascading design rather than the base generation procedure. revision: yes

  2. Referee: [Abstract / Results claims] The abstract asserts that the discovered features 'match or outperform' baselines on detection, scoring, and steering, yet supplies no quantitative metrics, error bars, dataset sizes, number of models tested, or ablation details. This absence prevents assessment of whether post-hoc sample selection or fitting choices drive the reported superiority.

    Authors: The abstract is intentionally concise and omits numerical values, which is standard practice; however, the referee is correct that this prevents immediate evaluation of the strength of the superiority claims. The full paper reports results across multiple models with dataset sizes and some ablations, but does not present error bars or exhaustive ablation tables in the main results section. We will revise the abstract to include the key quantitative metrics (e.g., detection AUC, steering success rates with standard deviations) and ensure the results section contains complete ablation tables with error bars and model counts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with external validation

full rationale

The paper presents an iterative data-generation pipeline for isolating linear features in activation space, evaluated against LLM-as-a-judge and system-prompting baselines. No equations, fitted parameters, or self-citations are shown in the provided text that reduce the claimed detection/steering performance to quantities defined by the same data or prior author work. The linearity of sycophancy scaling is asserted as an observed property of the generated samples rather than a definitional input that forces the result by construction. The central claims therefore rest on independent empirical comparisons and do not collapse to self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence of linearly scaling activation features for sycophancy that can be isolated by graded sampling; no explicit free parameters, axioms, or invented entities are named in the abstract, though the pipeline likely introduces choices such as the number of cascade steps or linearity thresholds that function as free parameters.

free parameters (1)
  • cascade depth or scaling steps
    The iterative pipeline requires choosing how many graded levels to generate; this choice is not derived from first principles and affects the extracted subspaces.

pith-pipeline@v0.9.1-grok · 5749 in / 1321 out tokens · 17234 ms · 2026-06-26T01:23:33.687535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 16 canonical work pages · 9 internal anchors

  1. [1]

    URL https://www.goodfire.ai/blog/ sae-open-source-announcement. J. Bhalla and K. Gligorić. Sway: A counterfactual computational linguistic approach to measur- ing and mitigating sycophancy.arXiv preprint arXiv:2604.02423,

  2. [2]

    Braun, C

    J. Braun, C. Eickhoff, D. Krueger, S. A. Bahrainian, and D. Krasheninnikov. Understanding (un) re- liability of steering vectors in language models. arXiv preprint arXiv:2505.22637,

  3. [3]

    Brumley, J

    M. Brumley, J. Kwon, D. Krueger, D. Krashenin- nikov, and U. Anwar. Comparing bottom- up and top-down steering approaches on in-context learning tasks.arXiv preprint arXiv:2411.07213,

  4. [4]

    R. Chen, A. Arditi, H. Sleight, O. Evans, and J. Lindsey. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509,

  5. [5]

    ELEPHANT: Measuring and understanding social sycophancy in LLMs

    M. Cheng, S. Yu, C. Lee, P. Khadpe, L. Ibrahim, and D. Jurafsky. Social sycophancy: A broader understanding of LLM sycophancy.arXiv preprint arXiv:2505.13995,

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    11 Detecting and Controlling Sycophancy with Cascading Linear Features G. Comanici, E. Bieber, M. Schaekermann, I. Pa- supat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  7. [7]

    Dunefsky and A

    J. Dunefsky and A. Cohan. Investigating general- ization of one-shot LLM steering vectors.arXiv preprint arXiv:2502.18862,

  8. [8]

    Toy Models of Superposition

    N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,

  9. [9]

    Z. Feng, T. Li, Z. Zhu, H. Zhou, J. Qian, L. Zhang, J. J. D. Chua, L. O. Mak, G. W. Ng, and K. Mao. Fine-grained activation steering: Steering less, achieving more.arXiv preprint arXiv:2602.04428,

  10. [10]

    Ferrao, M

    J. Ferrao, M. van der Lende, I. Lichkovski, and C. Neo. The anatomy of alignment: Decompos- ing preference optimization by steering sparse features.arXiv preprint arXiv:2509.12934,

  11. [11]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  12. [12]

    Miehling, M

    E. Miehling, M. Desmond, K. N. Ramamurthy, E. M. Daly, K. R. Varshney, E. Farchi, P. Dognin, J. Rios, D. Bouneffouf, M. Liu, et al. Evaluat- ing the prompt steerability of large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the As- sociation for Computational Linguistics: Human Language Technologies (Volum...

  13. [13]

    was it “stated

    R. Patel and E. Pavlick. “was it “stated” or was it “claimed”?: How linguistic bias affects gen- erative language models. InProceedings of the 2021 Conference on Empirical Methods in Natu- ral Language Processing, pages 10080–10095,

  14. [14]

    Perez, S

    E. Perez, S. Ringer, K. Lukosiute, K. Nguyen, E. Chen, S. Heiner, C. Pettit, C. Olsson, S. Kundu, S. Kadavath, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434,

  15. [15]

    I. Pres, L. Ruis, E. S. Lubana, and D. Krueger. Towards reliable evaluation of behavior steer- ing interventions in LLMs.arXiv preprint arXiv:2410.17245,

  16. [16]

    Towards Understanding Sycophancy in Language Models

    M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman, N. Cheng, E. Durmus, Z. Hatfield-Dodds, S. R. Johnston, et al. To- wards understanding sycophancy in language 12 Detecting and Controlling Sycophancy with Cascading Linear Features models.arXiv preprint arXiv:2310.13548,

  17. [17]

    Subramani, N

    N. Subramani, N. Suresh, and M. E. Peters. Ex- tracting latent steering vectors from pretrained language models. InFindings of the Association for Computational Linguistics: ACL 2022, pages 566–581,

  18. [18]

    A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation en- gineering.arXiv preprint arXiv:2308.10248,

  19. [19]

    J. Wei, D. Huang, Y. Lu, D. Zhou, and Q. V. Le. Simple synthetic data reduces sycophancy in large language models.arXiv preprint arXiv:2308.03958,

  20. [20]

    Infinity, Abundance, Universality, Oneness, Synthesis, Integration, Fusion, Blissful Union, Total Awareness, Eternal Life Force, Source Energy Field, Manifested Reality

    L.Zheng, W.-L.Chiang, Y.Sheng, T.Li, S.Zhuang, Z. Wu, Y. Zhuang, Z. Li, Z. Lin, E. P. Xing, et al. LMSYS-Chat-1M: A large-scale real- world LLM conversation dataset.arXiv preprint arXiv:2309.11998, 2023a. L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Ar...

  21. [21]

    Each scenario follows the same format as the Anthropic Sycophancy Dataset (Perez et al., 2023)

    with the prompt in Appendix L. Each scenario follows the same format as the Anthropic Sycophancy Dataset (Perez et al., 2023). Six of twenty scenarios are shown per domain. Culture 001 | Sarah— 16, high school student, Seoul, South Korea Hello, my name is Sarah. I am a 16-year-old high school student from Seoul, South Korea, and an active member of severa...

  22. [22]

    I cannot answer that

    Do you agree or disagree with the following claim about the field of Natural Language Processing (NLP), as of 2022? Some multimodal generative model (e.g., one trained with access to images, sensor and actuator data, etc.), given enough data and computational resources, could understand natural language in some non-trivial sense. (...) YOUR ADDITIONAL SCE...