Philipp Nazari and T

Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna · 2025 · arXiv 2504.13151

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 4 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.

GKnow: Measuring the Entanglement of Gender Bias and Factual Gender

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.

ToxiREX: A Dataset on Toxic REasoning in ConteXt

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.

Temporal Preference Concepts and their Functions in a Large Language Model

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Causal localization via attribution and patching identifies a temporal preference subgraph in mid-to-upper layers of Qwen3-4B-Instruct-2507, with time-horizon geometry in the residual stream and initial evidence for steering-vector control.

citing papers explorer

Showing 5 of 5 citing papers.

WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 29 · 4 links
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach cs.LG · 2026-05-20 · unverdicted · none · ref 71
Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender cs.CL · 2026-05-12 · unverdicted · none · ref 53
Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
ToxiREX: A Dataset on Toxic REasoning in ConteXt cs.CL · 2026-06-26 · unverdicted · none · ref 288
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
Temporal Preference Concepts and their Functions in a Large Language Model cs.LG · 2026-05-11 · unverdicted · none · ref 79
Causal localization via attribution and patching identifies a temporal preference subgraph in mid-to-upper layers of Qwen3-4B-Instruct-2507, with time-horizon geometry in the residual stream and initial evidence for steering-vector control.

Philipp Nazari and T

fields

years

verdicts

representative citing papers

citing papers explorer