Mechanistic? arXiv preprint arXiv:2410.09087

Mechanistic? , author= · 2024 · arXiv 2410.09087

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts

cs.AI · 2026-05-01 · unverdicted · novelty 7.0

Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.

Radical AI Interpretability

cs.AI · 2026-06-25 · unverdicted · novelty 6.0

A framework is proposed for solving for an AI system's beliefs and desires from its computational facts, with criteria for success tied to interpretability tests and emphasis on holistic attribution.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Do Activation Verbalization Methods Convey Privileged Information?

cs.CL · 2025-09-16 · unverdicted · novelty 5.0

Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.

From Weights to Activations: Is Steering the Next Frontier of Adaptation?

cs.CL · 2026-04-15 · unverdicted · novelty 4.0

Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.

Mechanistic Interpretability Needs Philosophy

cs.CL · 2025-06-23 · unverdicted · novelty 4.0

The paper claims that mechanistic interpretability needs philosophy as a partner to clarify concepts, refine methods, and navigate epistemic and ethical complexities in AI systems.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior cs.LG · 2026-05-06 · unverdicted · none · ref 114
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

Mechanistic? arXiv preprint arXiv:2410.09087

fields

years

verdicts

representative citing papers

citing papers explorer