Interpretability in Activation Space Analysis of Transformers: A Focused Survey , shorttitle =

Interpretability in activation space analysis of transformers: A focused survey , author= · arXiv 2302.09304

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail

cs.LG · 2026-07-02 · unverdicted · novelty 6.0

kNNGuard classifies prompts using multi-layer kNN on LLM hidden activations from 50 examples, matching or exceeding fine-tuned guardrails in F1 while running 2.7x to 10x faster with no training required.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

citing papers explorer

Showing 2 of 2 citing papers.

kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail cs.LG · 2026-07-02 · unverdicted · none · ref 16
kNNGuard classifies prompts using multi-layer kNN on LLM hidden activations from 50 examples, matching or exceeding fine-tuned guardrails in F1 while running 2.7x to 10x faster with no training required.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 182
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Interpretability in Activation Space Analysis of Transformers: A Focused Survey , shorttitle =

fields

years

verdicts

representative citing papers

citing papers explorer