Title resolution pending

https://transformer-circuits · 2023 · arXiv 2402.07321

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Data-driven Circuit Discovery for Interpretability of Language Models

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.

How Language Models Process Negation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.

Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

cs.CL · 2026-04-02 · unverdicted · novelty 6.0

MoE expert neurons show lower polysemanticity than dense FFN neurons, widening with sparser routing, and experts specialize in fine-grained tasks like specific linguistic operations, supporting expert-level interpretability.

Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis

cs.CL · 2025-09-29 · unverdicted · novelty 6.0

A new framework using Task Subspace Logit Attribution localizes attention heads specialized for task recognition and task learning in in-context learning, showing they align and rotate hidden states within a task subspace.

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

cs.CL · 2025-09-07 · unverdicted · novelty 6.0

Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.

Tracing Relational Knowledge Recall in Large Language Models

cs.CL · 2026-04-21 · unverdicted · novelty 5.0

Per-head attention contributions to the residual stream serve as strong linear features for classifying relational knowledge in LLMs, with probe accuracy correlating to relation specificity and signal distribution.

citing papers explorer

Showing 7 of 7 citing papers.

Data-driven Circuit Discovery for Interpretability of Language Models cs.AI · 2026-05-09 · unverdicted · none · ref 2
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
How Language Models Process Negation cs.CL · 2026-05-04 · unverdicted · none · ref 52
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings cs.LG · 2026-04-09 · unverdicted · none · ref 8
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level cs.CL · 2026-04-02 · unverdicted · none · ref 2
MoE expert neurons show lower polysemanticity than dense FFN neurons, widening with sparser routing, and experts specialize in fine-grained tasks like specific linguistic operations, supporting expert-level interpretability.
Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis cs.CL · 2025-09-29 · unverdicted · none · ref 7
A new framework using Task Subspace Logit Attribution localizes attention heads specialized for task recognition and task learning in in-context learning, showing they align and rotate hidden states within a task subspace.
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal cs.CL · 2025-09-07 · unverdicted · none · ref 12
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
Tracing Relational Knowledge Recall in Large Language Models cs.CL · 2026-04-21 · unverdicted · none · ref 3
Per-head attention contributions to the residual stream serve as strong linear features for classifying relational knowledge in LLMs, with probe accuracy correlating to relation specificity and signal distribution.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer