arxiv: 2605.07631 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

Inference Time Causal Probing in LLMs

Sadegh Khorasani , Saber Salehkaleybar , Negar Kiyavash , Matthias Grossglauser

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:49 UTC · model grok-4.3

classification 💻 cs.AI

keywords causal probinghidden state interventionmargin objectivelarge language modelsprobe-freeinference timecompleteness and selectivitytext editing

0 comments

The pith

HDMI steers hidden states in LLMs with a margin objective to deliver more reliable causal interventions than probe-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hidden-state Driven Margin Intervention (HDMI) to perform causal probing in generative models at inference time without training auxiliary probe classifiers. HDMI applies a gradient-based margin objective directly to hidden states, raising the probability of a target continuation while lowering that of the source using only the model's native output predictions. A lookahead variant called LA-HDMI extends this for text editing by backpropagating through embeddings to favor user-specified tokens. Reliability is quantified as the harmonic mean of completeness, which checks if the targeted property shifts as planned, and selectivity, which checks if unrelated properties remain unchanged. Experiments on the LGD agreement corpus and CausalGym benchmark across Meta-Llama-3-8B-Instruct and Pythia-70M show HDMI achieving higher reliability scores than prior approaches.

Core claim

The paper establishes that HDMI, by directly optimizing a margin objective on hidden states to favor target outputs over source outputs via the LLM's own predictive distribution, produces interventions with higher reliability—defined as the harmonic mean of completeness and selectivity—than methods that rely on trained probe classifiers, as demonstrated on the LGD agreement corpus and CausalGym benchmark for Meta-Llama-3-8B-Instruct and Pythia-70M; a lookahead extension further enables controlled text editing while preserving fluency.

What carries the argument

Hidden-state Driven Margin Intervention (HDMI): a probe-free gradient method that applies a margin loss directly to hidden states using the model's native output probabilities to steer a targeted property.

Load-bearing premise

That directly optimizing a margin objective on hidden states using the model's native output will change the targeted property as intended while preserving selectivity and fluency without introducing new misalignments.

What would settle it

Running HDMI on the LGD agreement corpus or CausalGym benchmark for Meta-Llama-3-8B-Instruct or Pythia-70M and observing that its harmonic mean of completeness and selectivity is not higher than the scores of prior probe-based methods.

Figures

Figures reproduced from arXiv: 2605.07631 by Matthias Grossglauser, Negar Kiyavash, Saber Salehkaleybar, Sadegh Khorasani.

**Figure 1.** Figure 1: X is the sequence x1:T and Y is the token at T + 1. Zc and Ze are latent linguistic properties. The classical “probing” paradigm evaluates whether a hidden vector hℓ(x) correlates with a linguistic property Z ∈ Z by training a supervised classifier on hidden representations hℓ(x). Causal probing instead asks how or whether manipulating the hidden representation at layer ℓ changes a linguistic property Z… view at source ↗

**Figure 2.** Figure 2: Reliability by tasks. Reliability is averaged across [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: HDMI editing examples. Left (a) and right (b) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Causal probing methods aim to test and control how internal representations influence the behavior of generative models. In causal probing, an intervention modifies hidden states so that a property takes on a different value. Most existing approaches define such interventions by training an auxiliary probe classifier, which ties the method to a specific task or model and risks misalignment with the model's predictive geometry. We propose Hidden-state Driven Margin Intervention (HDMI), a probe-free, gradient-based technique that directly steers hidden states using the model's native output. HDMI applies a margin objective that increases the probability of a target continuation while decreasing that of the source, without relying on probe classifiers. We further introduce a lookahead variant (LA-HDMI) for text editing that backpropagates through the softmax embeddings, modifying the current hidden state so that the likelihood of user-specified tokens increases in next token generations while preserving fluency. To evaluate interventions, we measure completeness (whether the targeted property changes as intended) and selectivity (whether unrelated properties are preserved), and report their harmonic mean as an overall measure of reliability. HDMI consistently achieves higher reliability than prior methods on the LGD agreement corpus and the CausalGym benchmark, across Meta-Llama-3-8B-Instruct, and Pythia-70M.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HDMI is a clean probe-free intervention idea but its reliability edge may be inflated by how completeness gets scored.

read the letter

The main thing here is a probe-free method called HDMI that steers hidden states in LLMs by directly optimizing a margin on the model's native next-token probabilities. It pushes up the probability of a target continuation and down the source one, without fitting any auxiliary classifier first. They also add a lookahead version for editing that backprops through the embeddings to favor specific tokens while trying to keep fluency. That is a genuine departure from the usual probe-based causal probing setup and it avoids some of the misalignment risks that come with training a separate head.

Referee Report

2 major / 2 minor

Summary. The paper introduces Hidden-state Driven Margin Intervention (HDMI), a probe-free causal probing technique for LLMs that directly optimizes a margin objective on the model's native next-token probabilities to modify hidden states, increasing the likelihood of target continuations while decreasing source ones. A lookahead variant (LA-HDMI) is proposed for text editing by backpropagating through softmax embeddings. Interventions are evaluated via completeness (intended property change) and selectivity (preservation of unrelated properties), with reliability as their harmonic mean. The central empirical claim is that HDMI and LA-HDMI achieve higher reliability than prior probe-based methods on the LGD agreement corpus and CausalGym benchmark, demonstrated across Meta-Llama-3-8B-Instruct and Pythia-70M.

Significance. If the evaluation metrics are shown to be independent of the optimization objective, the probe-free design using native model outputs would represent a meaningful advance in causal intervention methods, reducing reliance on auxiliary classifiers that may misalign with the model's geometry. The approach could enable more generalizable control over internal representations while preserving fluency, with potential implications for interpretability and editing in generative models. The use of standard benchmarks and multiple model scales is a positive aspect for comparability.

major comments (2)

[Evaluation section] Evaluation section (likely §4 or §5): The abstract and method description define completeness as whether the targeted property changes as intended, but provide no explicit formula or protocol for measuring it independently of the margin objective's probability shifts. If completeness is scored using the same next-token probability changes optimized by HDMI (or generations resulting directly from the intervention), the reliability metric becomes circular by construction, as successful optimization will register high completeness tautologically. This undermines the claim of superior reliability over probe-based baselines, which use separate classifiers for evaluation. Please provide the precise definition of completeness (e.g., via equation or pseudocode) and confirm it relies on an auxiliary, non-optimized measure such as held-out behavioral tests or independent probes.
[§3 (Method)] §3 (Method) and experiments: The weakest assumption—that direct margin optimization on hidden states alters the intended causal property without introducing new misalignments—is not sufficiently tested. The paper should include ablations or controls showing that selectivity on unrelated properties is preserved beyond superficial checks, and that downstream behavior changes are attributable to the targeted property rather than side effects in representation space. Without this, the higher reliability on LGD and CausalGym may not generalize.

minor comments (2)

[Abstract] Abstract: The phrasing 'across Meta-Llama-3-8B-Instruct, and Pythia-70M' contains a grammatical error (extra comma); correct to 'across Meta-Llama-3-8B-Instruct and Pythia-70M'.
[Experiments] The manuscript should include quantitative results, error bars, statistical significance tests, and full experimental protocol details (e.g., number of runs, hyperparameter choices) in the main text or appendix to support the reliability claims, as these are absent from the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below, clarifying the evaluation protocol and committing to additional empirical controls where appropriate.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (likely §4 or §5): The abstract and method description define completeness as whether the targeted property changes as intended, but provide no explicit formula or protocol for measuring it independently of the margin objective's probability shifts. If completeness is scored using the same next-token probability changes optimized by HDMI (or generations resulting directly from the intervention), the reliability metric becomes circular by construction, as successful optimization will register high completeness tautologically. This undermines the claim of superior reliability over probe-based baselines, which use separate classifiers for evaluation. Please provide the precise definition of completeness (e.g., via equation or pseudocode) and confirm it relies on an auxiliary, non-optimized measure such as held-out behavioral tests or independent probe

Authors: We appreciate the referee's concern regarding potential circularity. Completeness is measured via the benchmark-specific protocols on held-out test sets: for LGD, this is the change in agreement label on unseen examples (independent of the exact source/target token probabilities optimized by the margin loss); for CausalGym, it follows the benchmark's predefined behavioral tests for the target property. The margin objective only shapes the immediate next-token distribution during intervention, while completeness evaluates the resulting property shift through these separate metrics. We will add an explicit equation and pseudocode in the revised §4 to formalize this distinction and confirm reliance on auxiliary, non-optimized measures. revision: yes
Referee: [§3 (Method)] §3 (Method) and experiments: The weakest assumption—that direct margin optimization on hidden states alters the intended causal property without introducing new misalignments—is not sufficiently tested. The paper should include ablations or controls showing that selectivity on unrelated properties is preserved beyond superficial checks, and that downstream behavior changes are attributable to the targeted property rather than side effects in representation space. Without this, the higher reliability on LGD and CausalGym may not generalize.

Authors: We agree that stronger controls would better substantiate the core assumption. The current selectivity results are computed on the benchmarks' unrelated properties, but we will add ablations in the revision: (1) measuring preservation of additional unrelated attributes via auxiliary classifiers, (2) intervening on non-causal dimensions as a control, and (3) verifying that downstream changes track the targeted property rather than diffuse representation shifts. These will be reported alongside the existing reliability numbers to address generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and evaluation defined independently

full rationale

The HDMI intervention is defined directly via a margin objective on the model's native next-token probabilities, without reference to auxiliary probes or fitted parameters from the evaluation metrics. Completeness and selectivity are presented as separate evaluation criteria whose harmonic mean yields reliability; nothing in the abstract or described method reduces completeness to the optimized probability shifts by construction or renames the optimization success as an independent prediction. The superiority claim on LGD and CausalGym is therefore not forced by self-definition or self-citation chains. This is the normal case of a self-contained empirical method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper introduces no explicit free parameters, axioms, or invented entities; the approach relies on standard gradient-based optimization and margin losses already common in machine learning.

pith-pipeline@v0.9.0 · 5527 in / 1190 out tokens · 58840 ms · 2026-05-11T01:49:34.396693+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

[1]

Elucidating mechanisms of de- mographic bias in llms for healthcare.arXiv preprint arXiv:2502.13319,

Hiba Ahsan, Arnab Sen Sharma, Silvio Amir, David Bau, and Byron C Wallace. Elucidating mechanisms of de- mographic bias in llms for healthcare.arXiv preprint arXiv:2502.13319,

work page arXiv
[2]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding in- termediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644,

work page Pith review arXiv
[3]

Causalgym: Benchmark- ing causal interpretability methods on linguistic tasks, 2024.URL https://arxiv

A Arora, D Jurafsky, and C Potts. Causalgym: Benchmark- ing causal interpretability methods on linguistic tasks, 2024.URL https://arxiv. org/abs/2402.12560, page 12,

work page arXiv 2024
[4]

How reliable are causal probing interventions? arXiv preprint arXiv:2408.15510,

Marc Canby, Adam Davies, Chirag Rastogi, and Julia Hock- enmaier. How reliable are causal probing interventions? arXiv preprint arXiv:2408.15510,

work page arXiv
[5]

Plug and play language models: A simple approach to controlled text generation.arXiv preprint arXiv:1912.02164, 2019

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation.arXiv preprint arXiv:1912.02164,

work page arXiv 1912
[6]

Competence-based analysis of language models.arXiv preprint arXiv:2303.00333,

Adam Davies, Jize Jiang, and ChengXiang Zhai. Competence-based analysis of language models.arXiv preprint arXiv:2303.00333,

work page arXiv
[7]

arXiv preprint arXiv:1808.06640 , year=

Yanai Elazar and Yoav Goldberg. Adversarial removal of demographic attributes from text data.arXiv preprint arXiv:1808.06640,

work page arXiv
[8]

The dual-route model of induction.arXiv preprint arXiv:2504.03022,

Sheridan Feucht, Eric Todd, Byron Wallace, and David Bau. The dual-route model of induction.arXiv preprint arXiv:2504.03022,

work page arXiv
[9]

Explaining and Harnessing Adversarial Examples

URLhttp://arxiv.org/abs/1412.6572. John Hewitt and Percy Liang. Designing and inter- preting probes with control tasks.arXiv preprint arXiv:1909.03368,

work page internal anchor Pith review arXiv 1909
[10]

Gedi: Generative discriminator guided sequence generation.arXiv preprint arXiv:2009.06367, 2020

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. Gedi: Generative discrim- inator guided sequence generation.arXiv preprint arXiv:2009.06367,

work page arXiv 2009
[11]

Dexperts: Decoding-time controlled text generation with experts and anti-experts.arXiv preprint arXiv:2105.03023, 2021

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A Smith, and Yejin Choi. Dexperts: Decoding-time controlled text gen- eration with experts and anti-experts.arXiv preprint arXiv:2105.03023,

work page arXiv
[12]

Towards Deep Learning Models Resistant to Adversarial Attacks

URLhttps://arxiv.org/abs/1706.06083. Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35: 17359–17372,

work page internal anchor Pith review arXiv
[13]

arXiv preprint arXiv:2311.04897 , year=

Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C Wal- lace, and David Bau. Future lens: Anticipating subse- quent tokens from a single hidden state.arXiv preprint arXiv:2311.04897,

work page arXiv
[14]

Information-theoretic probing for linguistic structure

Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. Information-theoretic probing for linguistic structure. arXiv preprint arXiv:2004.03061,

work page arXiv 2004
[15]

Null it out: Guarding protected attributes by iterative nullspace projection

Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection.arXiv preprint arXiv:2004.07667,

work page arXiv 2004
[16]

Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction

Shauli Ravfogel, Grusha Prasad, Tal Linzen, and Yoav Gold- berg. Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. arXiv preprint arXiv:2105.06965,

work page arXiv
[17]

Disjoint processing mechanisms of hier- archical and linear grammars in large language models

Aruna Sankaranarayanan, Dylan Hadfield-Menell, and Aaron Mueller. Disjoint processing mechanisms of hier- archical and linear grammars in large language models. arXiv preprint arXiv:2501.08618,

work page arXiv
[18]

The Fourteenth International Conference on Learning Representations (ICLR 2026) , year =

Arnab Sen Sharma, Giordano Rogers, Natalie Shapira, and David Bau. Llms process lists with general filter heads. arXiv preprint arXiv:2510.26784,

work page arXiv
[19]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDi- armid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Steering Language Models With Activation Engineering

doi: 10.48550/arXiv.2308.10248. URL https: //arxiv.org/abs/2308.10248. Also circulated as "Steering Language Models With Activation Engineer- ing". Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. In- vestigating gender bias in language models using causal mediation analysis.Advances in neural in...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10248
[21]

and Potts, Christopher , booktitle=

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atti- cus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Reft: Representation finetun- ing for language models. 2024.URL https://arxiv. org/abs/2404.03592. Supplementary Material Sadegh Khorasani1 Saber Salehkaleybar2 Negar Kiyavash3 Matthias Grossglauser1 1School of Computer and Communication ...

work page arXiv 2024
[22]

Objective Completeness Selectivity Reliability Target-only (ϕτ ) 0.7407 0.8145 0.7758 Margin ((1)) 0.9412 0.8117 0.8716 ∆(target-only−margin)−0.2005 +0.0028−0.0958 Table 4: Hyperparameters with the range of values used in the tasks. Hyperparameter Range hdmi_alpha1 hdmi_inner_steps30 alterrep_alpha{0.1, 0.5} alterrep_inlp_rank_apply32 probe_epochs{75, 100...

work page 2005
[23]

expected- embedding

also uses interventions (editing the input or patching the internal representation) to measure causal effects cleanly and to pinpoint which internal components actually cause a behavior. Thus, it is more about using interventions to perform mediation analysis than using mediation analysis to intervene. Our work is the converse: we optimize a logit-margin,...

work page 2020