Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.arXiv preprint arXiv:2410.20526, 2024

Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, et al · 2024 · arXiv 2410.20526

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

cs.LG · 2026-05-13 · accept · novelty 8.0

Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Auto-interpretation labels for SAE features generalize poorly across languages and scripts, missing the same semantic content up to 4x more often in Serbian than English and more in Cyrillic than Latin despite deterministic transliteration.

Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes, Not Scalpels, for Raw Task Vector Model Editing

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

SAEs used for layer selection with raw task vectors outperform subspace projection and raise math reasoning accuracy on Gemma-3-4B-IT.

Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations

cs.LG · 2026-05-27 · conditional · novelty 7.0

SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.

How Language Models Process Negation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

LLMs process negation using both attention-based suppression and constructive representation mechanisms (construction dominant), with late-layer attention shortcuts explaining poor accuracy on negation tasks.

Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.

Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

Multilingual SAEs strengthen cross-lingual representations for reliable steering and an intersection-based rule selects effective layers without exhaustive search.

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or random perturbations.

Knowledge Vector of Logical Reasoning in Large Language Models

cs.CL · 2026-04-26 · unverdicted · novelty 6.0

Distinct linear knowledge vectors for deductive, inductive, and abductive reasoning in LLMs can be refined via complementary subspace constraints to improve performance through mutual knowledge sharing.

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.

NeuroCogMap Reveals Cognitive Organization of Large Language Models

q-bio.NC · 2026-07-01 · unverdicted · novelty 5.0

NeuroCogMap maps LLM internal representations into stable functional parcels tied to cognitive functions, failure modes, and human cortical activity during language tasks.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

cs.LG · 2025-09-11 · unverdicted · novelty 5.0

Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.

From Weights to Activations: Is Steering the Next Frontier of Adaptation?

cs.CL · 2026-04-15 · unverdicted · novelty 4.0

Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.

citing papers explorer

Showing 1 of 1 citing paper after filters.

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping cs.LG · 2026-04-13 · unverdicted · none · ref 37
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.

Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.arXiv preprint arXiv:2410.20526, 2024

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer