One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

Bhavith Chandra Challagundla; Hindol Roy Choudhury; Mohamed Deraz Nasr; Param Thakkar; Rishikesh Mallagundla; Sanskar Pandey; Shravani Challagundla; Spursh Deshpande; Wenhao Lu; Yugandhar Reddy Gogireddy

arxiv: 2606.09936 · v1 · pith:3LKX7XMBnew · submitted 2026-06-07 · 💻 cs.LG · cs.AI

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

Bhavith Chandra Challagundla , Sanskar Pandey , Param Thakkar , Rishikesh Mallagundla , Yugandhar Reddy Gogireddy , Wenhao Lu , Hindol Roy Choudhury , Shravani Challagundla

show 2 more authors

Mohamed Deraz Nasr Spursh Deshpande

This is my paper

Pith reviewed 2026-06-27 18:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords world modelsinterpretabilitycapability-typed interfaceactivation patchingsparse autoencodersprobingimagination rolloutsreinforcement learning

0 comments

The pith

A capability-typed interface with four required methods lets the same interpretability code run on recurrent, token-based, and embedding world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

World models appear in recurrent state-space, autoregressive token, and joint-embedding forms, yet each new substrate forces fresh implementations of probing, activation patching, sparse autoencoders, and surprise analysis. The paper traces the duplication to tooling that assumes transformer language models and therefore lacks primitives for actions, environment steps, or imagined trajectories. It supplies WorldModelLens, a thin adapter in which every model must expose encode, transition, initial state, and sample, plus declare optional heads through an explicit capability descriptor. A uniform hook-and-cache layer then supplies time-indexed activations and intervention replay, so each analysis is written once against the interface. Reinforcement-learning models and self-supervised models thereby become interchangeable targets without either architecture being forced to imitate the other.

Core claim

The shared structure of world models is captured by a small typed interface. Every model implements four required methods (encode, transition, initial state, sample) and declares a set of optional heads (decode, reward, continue, actor, critic) through an explicit capability descriptor. A single hook and cache layer exposes time-indexed activations, imagination rollouts, and intervention replay over this interface, allowing each analysis to be written once.

What carries the argument

The capability-typed adapter requiring every model to implement encode, transition, initial state, and sample while declaring optional heads via an explicit capability descriptor.

If this is right

Probing, activation patching, sparse autoencoders, and surprise analysis each become architecture-independent once written against the interface.
Reinforcement-learning world models with actor-critic heads and self-supervised models without actions are handled by the same code.
A single hook-and-cache implementation supplies time-indexed activations and intervention replay for any compliant model.
Imagination rollouts receive the same analysis primitives as real trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New world-model papers could be expected to ship the four methods and descriptor as a compatibility requirement.
Direct numerical comparison of failure modes across recurrent, token, and embedding families becomes feasible once the tooling layer is shared.
An automated checker could validate that a submitted model satisfies the interface before any interpretability experiment is attempted.

Load-bearing premise

The four required methods together with the capability descriptor are sufficient to support probing, activation patching, sparse autoencoders, and surprise analysis without any architecture-specific code.

What would settle it

A researcher ports a previously unsupported world-model architecture to the four methods and descriptor, then finds that sparse autoencoders or activation patching still require custom per-architecture logic to produce correct results.

Figures

Figures reproduced from arXiv: 2606.09936 by Bhavith Chandra Challagundla, Hindol Roy Choudhury, Mohamed Deraz Nasr, Param Thakkar, Rishikesh Mallagundla, Sanskar Pandey, Shravani Challagundla, Spursh Deshpande, Wenhao Lu, Yugandhar Reddy Gogireddy.

**Figure 2.** Figure 2: The same interface, populated differently. Solid teal boxes are heads a family exposes; [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Two access patterns over a rollout. run_with_cache (teal) records every activation under a (name, t) key for later analysis. run_with_hooks (amber) installs a function f that overwrites a chosen activation in place, which is the basis of patching, ablation, and intervention replay. and attn.hook_{query,key,value,pattern}, which makes the attention internals of transformer-token and joint-embedding models a… view at source ↗

**Figure 4.** Figure 4: Attribution and attention agree only weakly inside the I-JEPA predictor. Spearman rank [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

World models are now built on substantially different computational substrates. Latent recurrent state-space models such as PlaNet and the Dreamer family compress observations into recurrent states; token-based models such as IRIS quantize observations into a learned codebook and predict autoregressively with a transformer; and joint-embedding predictive architectures such as I-JEPA predict in a learned latent space with no pixel decoder. The interpretability methods applied to these models, including probing, activation patching, sparse autoencoders, and surprise analysis, share a common set of primitives, yet they are re-implemented from scratch for each architecture because existing hook-and-cache tooling assumes a transformer language model with no notion of actions, environment steps, or imagined rollouts. We argue that this fragmentation reflects the tooling rather than the models, and that the shared structure of world models is captured by a small typed interface. We present WorldModelLens, an open-source interpretability substrate organized around a capability-typed adapter: every model implements four required methods (encode, transition, initial state, sample) and declares a set of optional heads (decode, reward, continue, actor, critic) through an explicit capability descriptor, so that reinforcement-learning and self-supervised world models are first-class without either imitating the other. A single hook and cache layer exposes time-indexed activations, imagination rollouts, and intervention replay over this interface, allowing each analysis to be written once.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes a typed interface to unify interpretability tooling across world models but supplies no evidence the four methods suffice.

read the letter

The main takeaway is that WorldModelLens defines a capability-typed adapter with four required methods—encode, transition, initial state, sample—plus optional heads declared explicitly. This is meant to let probing, patching, sparse autoencoders, and surprise analysis be written once instead of per architecture.

The paper correctly identifies the practical issue: existing hook-and-cache tools assume plain transformer language models and ignore actions, environment steps, or rollouts. That fragmentation is real for people working with PlaNet-style recurrent models, IRIS token models, or I-JEPA embeddings.

The interface design itself is straightforward and avoids forcing every model into a single shape by using the capability descriptor. That part is a clean engineering move.

The soft spot is the central assumption that these four methods plus the descriptor expose everything needed. The abstract asserts coverage but gives no examples, no coverage argument, and no counter-example checks. If surprise analysis requires the full next-token distribution instead of samples, or if SAE training needs direct access to non-exposed tensors, the single-analysis guarantee would break. No implementation or test results are described.

This is for researchers who already maintain world-model codebases and want to share interpretability code. A reader who cares about reducing duplicated engineering effort in RL and self-supervised model analysis would get value from the interface sketch.

The work shows clear thinking about shared structure. It deserves peer review so the full paper can be checked for actual coverage and usage examples.

Referee Report

1 major / 2 minor

Summary. The paper proposes WorldModelLens, a capability-typed interface for world models across architectures (latent recurrent state-space models like PlaNet/Dreamer, token-based models like IRIS, and joint-embedding models like I-JEPA). It defines four required methods (encode, transition, initial_state, sample) plus an explicit capability descriptor for optional heads (decode, reward, continue, actor, critic). A unified hook-and-cache layer then supports time-indexed activations, imagination rollouts, and interventions, allowing interpretability methods (probing, activation patching, sparse autoencoders, surprise analysis) to be written once rather than reimplemented per architecture.

Significance. If the interface is shown to be sufficient, the work would reduce fragmentation in interpretability tooling for world models by providing a reusable substrate that treats RL and self-supervised models uniformly, enabling single implementations of analyses over diverse computational substrates without architecture-specific code paths.

major comments (1)

[Abstract] Abstract: the central claim that the four required methods plus capability descriptor suffice to support the full set of listed interpretability methods (probing, activation patching, sparse autoencoders, surprise analysis) across the cited model families without architecture-specific code or loss of functionality is asserted but not accompanied by any coverage argument, implementation example, or test demonstrating that primitives such as full next-token distributions for surprise analysis or direct access to intermediate tensors for SAE training are exposed.

minor comments (2)

The manuscript would benefit from explicit pseudocode or a small worked example showing how, e.g., activation patching is expressed using only the declared interface methods.
Notation for the capability descriptor and how optional heads are declared should be formalized in a dedicated section or figure for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the clear identification of the gap in the abstract. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the four required methods plus capability descriptor suffice to support the full set of listed interpretability methods (probing, activation patching, sparse autoencoders, surprise analysis) across the cited model families without architecture-specific code or loss of functionality is asserted but not accompanied by any coverage argument, implementation example, or test demonstrating that primitives such as full next-token distributions for surprise analysis or direct access to intermediate tensors for SAE training are exposed.

Authors: We agree that the abstract, due to length constraints, asserts sufficiency without an explicit coverage argument or inline examples. The full manuscript (Sections 3–4) defines the hook-and-cache layer that registers and exposes time-indexed intermediate activations from any model implementing the four core methods, directly supporting SAE training and activation patching on latent states or token embeddings without per-architecture code. For surprise analysis, the sample method returns next-state predictions; token-based models (e.g., IRIS) can declare a capability head exposing logits or full distributions when needed, while the capability descriptor ensures only supported primitives are used. We will revise the abstract to add one sentence summarizing this coverage and referencing the relevant sections. No new empirical test is required for an interface proposal, but the revision will make the claim traceable from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: design proposal for typed interface with no derivations or self-referential reductions

full rationale

The paper proposes a software abstraction (WorldModelLens) organized around four required methods and an explicit capability descriptor. No equations, fitted parameters, predictions, or self-citations appear in the provided text. The central claim is an engineering hypothesis about interface sufficiency for interpretability primitives across model families; this is not reduced to its inputs by construction, self-definition, or renaming. No load-bearing steps match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the four methods plus capability descriptor are sufficient to expose all necessary state and rollout information for interpretability without architecture-specific extensions.

axioms (1)

domain assumption World models share a common structure that can be captured by encode, transition, initial state, and sample methods plus optional heads.
This assumption is invoked to argue that fragmentation is due to tooling rather than fundamental differences.

invented entities (1)

capability-typed adapter no independent evidence
purpose: To provide a uniform interface and explicit descriptor so that a single hook-and-cache layer works across model types.
New abstraction introduced to solve the re-implementation problem.

pith-pipeline@v0.9.1-grok · 5843 in / 1271 out tokens · 18248 ms · 2026-06-27T18:36:49.417919+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 10 canonical work pages · 8 internal anchors

[1]

Ha and J

D. Ha and J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution. InNeurIPS, 2018

2018
[2]

Hafner, T

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning Latent Dynamics for Planning from Pixels. InICML, 2019

2019
[3]

Hafner, T

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. InICLR, 2020

2020
[4]

Hafner, T

D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering Atari with Discrete World Models. InICLR, 2021

2021
[5]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering Diverse Domains through World Models. arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Micheli, E

V. Micheli, E. Alonso, and F. Fleuret. Transformers are Sample-Efficient World Models. InICLR, 2023

2023
[7]

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision Transformer: Reinforcement Learning via Sequence Modeling. InNeurIPS, 2021

2021
[8]

Assran, Q

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. InCVPR, 2023

2023
[9]

Hansen, H

N. Hansen, H. Su, and X. Wang. TD-MPC2: Scalable, Robust World Models for Continuous Control. In ICLR, 2024

2024
[10]

Revisiting Feature Prediction for Learning Visual Representations from Video

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y. LeCun, M. Assran, and N. Ballas. V-JEPA: Latent Video Prediction for Visual Representation Learning.arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA. Cosmos World Foundation Model Platform for Physical AI.arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Nanda and J

N. Nanda and J. Bloom. TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models.https://github.com/TransformerLensOrg/TransformerLens, 2022

2022
[13]

Elhage, N

N. Elhage, N. Nanda, C. Olsson, T. Henighan, et al. A Mathematical Framework for Transformer Circuits.Transformer Circuits Thread, 2021

2021
[14]

K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small. InICLR, 2023

2023
[15]

K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and Editing Factual Associations in GPT. In NeurIPS, 2022

2022
[16]

Localizing Model Behavior with Path Patching

N. Goldowsky-Dill, C. MacLeod, L. Sato, and A. Arora. Localizing Model Behavior with Path Patching. arXiv:2304.05969, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Cunningham, A

H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse Autoencoders Find Highly Interpretable Features in Language Models. InICLR, 2024

2024
[18]

Bricken, A

T. Bricken, A. Templeton, J. Batson, et al. Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.Transformer Circuits Thread, 2023

2023
[19]

Understanding intermediate layers using linear classifier probes

G. Alain and Y. Bengio. Understanding Intermediate Layers Using Linear Classifier Probes. arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Belinkov

Y. Belinkov. Probing Classifiers: Promises, Shortcomings, and Advances.Computational Linguistics, 48(1), 2022

2022
[21]

Kokhlikyan, V

N. Kokhlikyan, V. Miglani, M. Martin, et al. Captum: A Unified and Generic Model Interpretability Library for PyTorch.arXiv:2009.07896, 2020

work page arXiv 2009
[22]

Fiotto-Kaufman, A

J. Fiotto-Kaufman, A. R. Loftus, E. Todd, et al. NNsight and NDIF: Democratizing Access to Foundation Model Internals.arXiv:2407.14561, 2024. 11

work page arXiv 2024
[23]

D. D. Johnson. Penzai and Treescope: Tools for Visualizing and Manipulating Neural Networks. https://github.com/google-deepmind/penzai, 2024

2024
[24]

Sundararajan, A

M. Sundararajan, A. Taly, and Q. Yan. Axiomatic Attribution for Deep Networks. InICML, 2017

2017
[25]

Jain and B

S. Jain and B. C. Wallace. Attention is not Explanation. InNAACL, 2019

2019
[26]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of Neural Network Representations Revisited. InICML, 2019

2019
[27]

R. T. Q. Chen, X. Li, R. Grosse, and D. Duvenaud. Isolating Sources of Disentanglement in Variational Autoencoders. InNeurIPS, 2018

2018
[28]

Eastwood and C

C. Eastwood and C. K. I. Williams. A Framework for the Quantitative Evaluation of Disentangled Representations. InICLR, 2018

2018
[29]

Kumar, P

A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational Inference of Disentangled Latent Concepts from Unlabeled Observations. InICLR, 2018

2018
[30]

Higgins, L

I. Higgins, L. Matthey, A. Pal, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. InICLR, 2017

2017
[31]

E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbel-Softmax. InICLR, 2017

2017
[32]

C. J. Maddison, A. Mnih, and Y. W. Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. InICLR, 2017

2017
[33]

Schrittwieser, I

J. Schrittwieser, I. Antonoglou, T. Hubert, et al. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.Nature, 588, 2020

2020
[34]

Janner, Q

M. Janner, Q. Li, and S. Levine. Offline Reinforcement Learning as One Big Sequence Modeling Problem. InNeurIPS, 2021

2021
[35]

Bruce, M

J. Bruce, M. Dennis, A. Edwards, et al. Genie: Generative Interactive Environments. InICML, 2024

2024
[36]

Y. LeCun. A Path Towards Autonomous Machine Intelligence.OpenReview, 2022

2022
[37]

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked Autoencoders Are Scalable Vision Learners. InCVPR, 2022

2022
[38]

Grill, F

J.-B. Grill, F. Strub, F. Altché, et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. InNeurIPS, 2020

2020
[39]

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A Simple Framework for Contrastive Learning of Visual Representations. InICML, 2020

2020
[40]

Caron, H

M. Caron, H. Touvron, I. Misra, et al. Emerging Properties in Self-Supervised Vision Transformers. In ICCV, 2021

2021
[41]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, et al. Learning Transferable Visual Models from Natural Language Supervision. InICML, 2021

2021
[42]

B. A. Olshausen and D. J. Field. Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images.Nature, 381, 1996

1996
[43]

L. Gao, T. Dupré la Tour, H. Tillman, et al. Scaling and Evaluating Sparse Autoencoders. arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Improving Dictionary Learning with Gated Sparse Autoencoders

S. Rajamanoharan, A. Conmy, L. Smith, et al. Improving Dictionary Learning with Gated Sparse Autoencoders.arXiv:2404.16014, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Conmy, A

A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards Automated Circuit Discovery for Mechanistic Interpretability. InNeurIPS, 2023

2023
[46]

N. Nanda. Attribution Patching: Activation Patching at Industrial Scale.https://neelnanda.io/ attribution-patching, 2023. 12

2023
[47]

L. Chan, A. Garriga-Alonso, N. Goldowsky-Dill, et al. Causal Scrubbing: A Method for Rigorously Testing Interpretability Hypotheses.Alignment Forum, 2022

2022
[48]

Interpreting GPT: The Logit Lens.LessWrong, 2020

nostalgebraist. Interpreting GPT: The Logit Lens.LessWrong, 2020

2020
[49]

SmoothGrad: removing noise by adding noise

D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg. SmoothGrad: Removing Noise by Adding Noise.arXiv:1706.03825, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[50]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. InICCV, 2017

2017
[51]

S. M. Lundberg and S.-I. Lee. A Unified Approach to Interpreting Model Predictions. InNeurIPS, 2017

2017
[52]

Samek, A

W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. Müller. Evaluating the Visualization of What a Deep Neural Network Has Learned.IEEE TNNLS, 28(11), 2017

2017
[53]

Raghu, J

M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. InNeurIPS, 2017

2017
[54]

Hewitt and P

J. Hewitt and P. Liang. Designing and Interpreting Probes with Control Tasks. InEMNLP, 2019

2019
[55]

K. Lee, K. Lee, H. Lee, and J. Shin. A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. InNeurIPS, 2018

2018
[56]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, et al. Attention Is All You Need. InNeurIPS, 2017

2017
[57]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. InICLR, 2021

2021
[58]

D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. InICLR, 2014

2014
[59]

R. S. Sutton and A. G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

2018
[60]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, et al. DINOv2: Learning Robust Visual Features without Supervision.Transactions on Machine Learning Research, 2024

2024
[61]

Bengio, A

Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 2013

2013
[62]

K. Li, A. K. Hopkins, D. Bau, F. Viégas, H. Pfister, and M. Wattenberg. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. InICLR, 2023

2023
[63]

B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viégas, and R. Sayres. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). InICML, 2018

2018
[64]

Geiger, H

A. Geiger, H. Lu, T. Icard, and C. Potts. Causal Abstractions of Neural Networks. InNeurIPS, 2021

2021
[65]

Templeton, T

A. Templeton, T. Conerly, J. Marcus, et al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

2024
[66]

J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, and S. Shieber. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. InNeurIPS, 2020. 13

2020

[1] [1]

Ha and J

D. Ha and J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution. InNeurIPS, 2018

2018

[2] [2]

Hafner, T

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning Latent Dynamics for Planning from Pixels. InICML, 2019

2019

[3] [3]

Hafner, T

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. InICLR, 2020

2020

[4] [4]

Hafner, T

D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering Atari with Discrete World Models. InICLR, 2021

2021

[5] [5]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering Diverse Domains through World Models. arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Micheli, E

V. Micheli, E. Alonso, and F. Fleuret. Transformers are Sample-Efficient World Models. InICLR, 2023

2023

[7] [7]

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision Transformer: Reinforcement Learning via Sequence Modeling. InNeurIPS, 2021

2021

[8] [8]

Assran, Q

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. InCVPR, 2023

2023

[9] [9]

Hansen, H

N. Hansen, H. Su, and X. Wang. TD-MPC2: Scalable, Robust World Models for Continuous Control. In ICLR, 2024

2024

[10] [10]

Revisiting Feature Prediction for Learning Visual Representations from Video

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y. LeCun, M. Assran, and N. Ballas. V-JEPA: Latent Video Prediction for Visual Representation Learning.arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA. Cosmos World Foundation Model Platform for Physical AI.arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Nanda and J

N. Nanda and J. Bloom. TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models.https://github.com/TransformerLensOrg/TransformerLens, 2022

2022

[13] [13]

Elhage, N

N. Elhage, N. Nanda, C. Olsson, T. Henighan, et al. A Mathematical Framework for Transformer Circuits.Transformer Circuits Thread, 2021

2021

[14] [14]

K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small. InICLR, 2023

2023

[15] [15]

K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and Editing Factual Associations in GPT. In NeurIPS, 2022

2022

[16] [16]

Localizing Model Behavior with Path Patching

N. Goldowsky-Dill, C. MacLeod, L. Sato, and A. Arora. Localizing Model Behavior with Path Patching. arXiv:2304.05969, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Cunningham, A

H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse Autoencoders Find Highly Interpretable Features in Language Models. InICLR, 2024

2024

[18] [18]

Bricken, A

T. Bricken, A. Templeton, J. Batson, et al. Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.Transformer Circuits Thread, 2023

2023

[19] [19]

Understanding intermediate layers using linear classifier probes

G. Alain and Y. Bengio. Understanding Intermediate Layers Using Linear Classifier Probes. arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

Belinkov

Y. Belinkov. Probing Classifiers: Promises, Shortcomings, and Advances.Computational Linguistics, 48(1), 2022

2022

[21] [21]

Kokhlikyan, V

N. Kokhlikyan, V. Miglani, M. Martin, et al. Captum: A Unified and Generic Model Interpretability Library for PyTorch.arXiv:2009.07896, 2020

work page arXiv 2009

[22] [22]

Fiotto-Kaufman, A

J. Fiotto-Kaufman, A. R. Loftus, E. Todd, et al. NNsight and NDIF: Democratizing Access to Foundation Model Internals.arXiv:2407.14561, 2024. 11

work page arXiv 2024

[23] [23]

D. D. Johnson. Penzai and Treescope: Tools for Visualizing and Manipulating Neural Networks. https://github.com/google-deepmind/penzai, 2024

2024

[24] [24]

Sundararajan, A

M. Sundararajan, A. Taly, and Q. Yan. Axiomatic Attribution for Deep Networks. InICML, 2017

2017

[25] [25]

Jain and B

S. Jain and B. C. Wallace. Attention is not Explanation. InNAACL, 2019

2019

[26] [26]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of Neural Network Representations Revisited. InICML, 2019

2019

[27] [27]

R. T. Q. Chen, X. Li, R. Grosse, and D. Duvenaud. Isolating Sources of Disentanglement in Variational Autoencoders. InNeurIPS, 2018

2018

[28] [28]

Eastwood and C

C. Eastwood and C. K. I. Williams. A Framework for the Quantitative Evaluation of Disentangled Representations. InICLR, 2018

2018

[29] [29]

Kumar, P

A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational Inference of Disentangled Latent Concepts from Unlabeled Observations. InICLR, 2018

2018

[30] [30]

Higgins, L

I. Higgins, L. Matthey, A. Pal, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. InICLR, 2017

2017

[31] [31]

E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbel-Softmax. InICLR, 2017

2017

[32] [32]

C. J. Maddison, A. Mnih, and Y. W. Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. InICLR, 2017

2017

[33] [33]

Schrittwieser, I

J. Schrittwieser, I. Antonoglou, T. Hubert, et al. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.Nature, 588, 2020

2020

[34] [34]

Janner, Q

M. Janner, Q. Li, and S. Levine. Offline Reinforcement Learning as One Big Sequence Modeling Problem. InNeurIPS, 2021

2021

[35] [35]

Bruce, M

J. Bruce, M. Dennis, A. Edwards, et al. Genie: Generative Interactive Environments. InICML, 2024

2024

[36] [36]

Y. LeCun. A Path Towards Autonomous Machine Intelligence.OpenReview, 2022

2022

[37] [37]

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked Autoencoders Are Scalable Vision Learners. InCVPR, 2022

2022

[38] [38]

Grill, F

J.-B. Grill, F. Strub, F. Altché, et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. InNeurIPS, 2020

2020

[39] [39]

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A Simple Framework for Contrastive Learning of Visual Representations. InICML, 2020

2020

[40] [40]

Caron, H

M. Caron, H. Touvron, I. Misra, et al. Emerging Properties in Self-Supervised Vision Transformers. In ICCV, 2021

2021

[41] [41]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, et al. Learning Transferable Visual Models from Natural Language Supervision. InICML, 2021

2021

[42] [42]

B. A. Olshausen and D. J. Field. Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images.Nature, 381, 1996

1996

[43] [43]

L. Gao, T. Dupré la Tour, H. Tillman, et al. Scaling and Evaluating Sparse Autoencoders. arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Improving Dictionary Learning with Gated Sparse Autoencoders

S. Rajamanoharan, A. Conmy, L. Smith, et al. Improving Dictionary Learning with Gated Sparse Autoencoders.arXiv:2404.16014, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Conmy, A

A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards Automated Circuit Discovery for Mechanistic Interpretability. InNeurIPS, 2023

2023

[46] [46]

N. Nanda. Attribution Patching: Activation Patching at Industrial Scale.https://neelnanda.io/ attribution-patching, 2023. 12

2023

[47] [47]

L. Chan, A. Garriga-Alonso, N. Goldowsky-Dill, et al. Causal Scrubbing: A Method for Rigorously Testing Interpretability Hypotheses.Alignment Forum, 2022

2022

[48] [48]

Interpreting GPT: The Logit Lens.LessWrong, 2020

nostalgebraist. Interpreting GPT: The Logit Lens.LessWrong, 2020

2020

[49] [49]

SmoothGrad: removing noise by adding noise

D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg. SmoothGrad: Removing Noise by Adding Noise.arXiv:1706.03825, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[50] [50]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. InICCV, 2017

2017

[51] [51]

S. M. Lundberg and S.-I. Lee. A Unified Approach to Interpreting Model Predictions. InNeurIPS, 2017

2017

[52] [52]

Samek, A

W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. Müller. Evaluating the Visualization of What a Deep Neural Network Has Learned.IEEE TNNLS, 28(11), 2017

2017

[53] [53]

Raghu, J

M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. InNeurIPS, 2017

2017

[54] [54]

Hewitt and P

J. Hewitt and P. Liang. Designing and Interpreting Probes with Control Tasks. InEMNLP, 2019

2019

[55] [55]

K. Lee, K. Lee, H. Lee, and J. Shin. A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. InNeurIPS, 2018

2018

[56] [56]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, et al. Attention Is All You Need. InNeurIPS, 2017

2017

[57] [57]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. InICLR, 2021

2021

[58] [58]

D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. InICLR, 2014

2014

[59] [59]

R. S. Sutton and A. G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

2018

[60] [60]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, et al. DINOv2: Learning Robust Visual Features without Supervision.Transactions on Machine Learning Research, 2024

2024

[61] [61]

Bengio, A

Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 2013

2013

[62] [62]

K. Li, A. K. Hopkins, D. Bau, F. Viégas, H. Pfister, and M. Wattenberg. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. InICLR, 2023

2023

[63] [63]

B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viégas, and R. Sayres. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). InICML, 2018

2018

[64] [64]

Geiger, H

A. Geiger, H. Lu, T. Icard, and C. Potts. Causal Abstractions of Neural Networks. InNeurIPS, 2021

2021

[65] [65]

Templeton, T

A. Templeton, T. Conerly, J. Marcus, et al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

2024

[66] [66]

J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, and S. Shieber. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. InNeurIPS, 2020. 13

2020