Steering large language models using conceptors: Improving addition-based activation engineering.arXiv preprint arXiv:2410.16314

Steering large language models using conceptors: Improving addition-based activation engineering , author= · 2024 · arXiv 2410.16314

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

$\alpha$-TCAV: A Unified Framework for Testing with Concept Activation Vectors

stat.ML · 2026-05-15 · unverdicted · novelty 7.0

α-TCAV replaces TCAV's hard indicator with a tunable smooth function to create a unified probabilistic framework with lower variance and guidance for parameter choice or Bayes-optimal scoring.

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

cs.CL · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

citing papers explorer

Showing 4 of 4 citing papers.

$\alpha$-TCAV: A Unified Framework for Testing with Concept Activation Vectors stat.ML · 2026-05-15 · unverdicted · none · ref 131
α-TCAV replaces TCAV's hard indicator with a tunable smooth function to create a unified probabilistic framework with lower variance and guidance for parameter choice or Bayes-optimal scoring.
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions cs.CL · 2026-05-11 · unverdicted · none · ref 12 · 2 links
GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning cs.CL · 2026-04-21 · unverdicted · none · ref 25
Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 242
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Steering large language models using conceptors: Improving addition-based activation engineering.arXiv preprint arXiv:2410.16314

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer