A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716

Shawn Im, Sharon Li · 2025 · arXiv 2502.02716

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.

When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.

VSPO: Vector-Steered Policy Optimization for Behavioral Control

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

cs.LG · 2025-05-30 · unverdicted · novelty 6.0

K-Steering uses a non-linear multi-label classifier on activations to compute gradient-based intervention directions for unified multi-attribute control in LLMs, outperforming linear baselines on ToneBank and DebateMix benchmarks across three model families.

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

cs.CL · 2026-05-30 · unverdicted · novelty 5.0

SALSA adapts speech-aware LLMs via supervised layer-wise steering vectors, reporting up to 46.8% relative gains over zero-shot on out-of-domain speech benchmarks.

Steered Generation via Gradient-Based Optimization on Sparse Query Features

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

Prototype-Based Sparse Steering decomposes query activations with SAEs and optimizes sparse features via gradients to steer LLM outputs toward specific behaviors.

citing papers explorer

Showing 7 of 7 citing papers.

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers cs.LG · 2026-05-17 · unverdicted · none · ref 35
FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search cs.LG · 2026-05-09 · unverdicted · none · ref 17 · 2 links
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
VSPO: Vector-Steered Policy Optimization for Behavioral Control cs.LG · 2026-05-15 · unverdicted · none · ref 12
VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 75
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models cs.LG · 2025-05-30 · unverdicted · none · ref 3
K-Steering uses a non-linear multi-label classifier on activations to compute gradient-based intervention directions for unified multi-attribute control in LLMs, outperforming linear baselines on ToneBank and DebateMix benchmarks across three model families.
SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors cs.CL · 2026-05-30 · unverdicted · none · ref 63
SALSA adapts speech-aware LLMs via supervised layer-wise steering vectors, reporting up to 46.8% relative gains over zero-shot on out-of-domain speech benchmarks.
Steered Generation via Gradient-Based Optimization on Sparse Query Features cs.LG · 2026-05-21 · unverdicted · none · ref 20
Prototype-Based Sparse Steering decomposes query activations with SAEs and optimizes sparse features via gradients to steer LLM outputs toward specific behaviors.

A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716

fields

years

verdicts

representative citing papers

citing papers explorer