Jacobian Scopes: token-level causal attributions in LLMs
read the original abstract
Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model's prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover word- and phrase-level translation strategies, and shed light on recently debated mechanisms underlying in-context time-series forecasting. To facilitate exploration of Jacobian Scopes on custom text, we open-source our implementations and provide a cloud-hosted interactive demo at https://huggingface.co/spaces/Typony/JacobianScopes.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
The Attribution Contract: Feature Attribution for Generative Language Models
Introduces the Attribution Contract specification to clarify feature attribution claims in generative language models by naming the output explained, eligible features, generative process, fixed elements, and attribut...
-
GIF: Locally Sound Geometric Information Flow Control for LLMs
GIF introduces a Jacobian-based upper bound on input-output mutual information in LLMs with formal Lean proof and strong empirical recall on injection and leakage benchmarks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.