Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
hub Mixed citations
Steering Llama 2 via Contrastive Activation Addition , url =
Mixed citation behavior. Most common role is background (40%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
10.3-22.9% of pass@k=0 math examples across GSM8K and MATH are recovered by a deterministic six-chain regime using activation grafting, showing a sampling blind spot in difficulty estimation.
Probes predicting future behaviors from intermediate steps enable Future Probe Controlled Generation for steering large reasoning models with minimal quality degradation.
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
Query Lens extends Logit Lens to interpret sparse features via key-value analysis and indirect effects, yielding coherent token signatures where Logit Lens fails, and proposes the Subspace Channel Hypothesis.
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
Valence is geometrically encoded in Apertus-8B and Gemma-4-E4B with PC1 correlations of 0.76 and 0.83, but emerges at different depths than in Claude and arousal alignment varies by generated corpus.
LLMs fail to reliably self-report adversarial prefill attacks at 27.3% average intention-claim rate on compromised outputs, with signals tied to refusal reasoning, probe framing, and partial mitigation via finetuning that does not transfer.
Activation steering on early layers improves diversity of synthetic data for low-resource languages and often boosts downstream classifier performance compared to non-steered prompting.
Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.
ALMs encode audio evidence but override it with text in conflicts; GACL interpolates joint and same-audio scores to repair reversals, gaining 17.8 nAUC points under a 5pp faithfulness budget.
RCA is a training-free module that boosts input context signal strength in the residual stream of LLMs by orthogonal decoupling of attention routing from value magnitude.
An exposure-based split on BLiMP data reveals delayed generalization in five grammatical phenomena during LLM pre-training, with post-generalization shifts in concept vector predictiveness and attention patterns.
MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.
citing papers explorer
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
Interpreting Reinforcement Learning Agents with Susceptibilities
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.