WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
and Potts, Christopher , year =
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7roles
background 1polarities
background 1representative citing papers
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
Decoder-based VLMs hallucinate because visual embeddings are over-aligned to a text manifold; projecting out the top principal components of a universal linguistic subspace reduces this bias and improves benchmark performance.
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
citing papers explorer
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
-
Are Sparse Autoencoder Benchmarks Reliable?
An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
-
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
-
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
Decoder-based VLMs hallucinate because visual embeddings are over-aligned to a text manifold; projecting out the top principal components of a universal linguistic subspace reduces this bias and improves benchmark performance.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.