α-TCAV replaces TCAV's hard indicator with a tunable smooth function to create a unified probabilistic framework with lower variance and guidance for parameter choice or Bayes-optimal scoring.
Steering large language models using conceptors: Improving addition-based activation engineering.arXiv preprint arXiv:2410.16314
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
unclear 1representative citing papers
GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.
Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
citing papers explorer
-
$\alpha$-TCAV: A Unified Framework for Testing with Concept Activation Vectors
α-TCAV replaces TCAV's hard indicator with a tunable smooth function to create a unified probabilistic framework with lower variance and guidance for parameter choice or Bayes-optimal scoring.
-
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.
-
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning
Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.