Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Andrew Lee; Binxu Wang; Demba Ba; Ekdeep S. Lubana; Martin Wattenberg; Matthew Kowal; Michael A. Lepori; Randall Balestriero; Sonia Joseph; Talia Konkle

arxiv: 2510.08638 · v3 · submitted 2025-10-08 · 💻 cs.CV · cs.AI

Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel , Binxu Wang , Michael A. Lepori , Matthew Kowal , Andrew Lee , Randall Balestriero , Sonia Joseph , Ekdeep S. Lubana

show 3 more authors

Talia Konkle Demba Ba Martin Wattenberg

This is my paper

Pith reviewed 2026-05-18 08:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords DINOv2sparse autoencodersvision transformersMinkowski representation hypothesisconvex mixturesconceptual spacesmodel interpretabilityrepresentation geometry

0 comments

The pith

DINOv2 tokens form as convex mixtures of archetypes such as a rabbit among animals or brown among colors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper begins with the Linear Representation Hypothesis and extracts a 32,000-unit dictionary of concepts from DINOv2 using sparse autoencoders. It then shows how classification, segmentation, and depth estimation each recruit specialized subsets of these concepts, including negation-style features and boundary detectors. Geometric analysis reveals that representations are partly dense, the dictionary grows coherent, and tokens within an image occupy low-dimensional locally connected sets that survive removal of position information. These patterns support a refined account in which tokens arise as convex combinations of archetype points that bound regions in representation space.

Core claim

Tokens are formed by combining convex mixtures of archetypes. This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis and examine its empirical signatures and implications for interpreting vision-transformer representations.

What carries the argument

Minkowski Representation Hypothesis (MRH): the proposal that tokens arise as convex mixtures of archetype vectors whose sums define bounded conceptual regions.

If this is right

Classification exploits negation concepts that activate everywhere except on target objects.
Segmentation draws on coherent subspaces of boundary detectors.
Depth estimation recruits three distinct monocular cues that align with known visual neuroscience.
The dictionary evolves away from orthogonal ideals toward greater internal coherence.
Interpretability tools must treat representations as sums of convex mixtures rather than isolated linear features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If MRH holds, editing a small set of archetype directions could control higher-level object properties in a model.
The same convex-mixture structure may appear in other transformer families once their representations are probed with comparable dictionaries.
Connecting the hypothesis to Gardenfors' theory opens the possibility of mapping model archetypes onto human-perceived quality dimensions such as hue or texture.

Load-bearing premise

The observed density, coherence, and low-dimensional connected token sets directly indicate convex-mixture organization rather than other non-convex or non-archetype structures.

What would settle it

Showing that tokens inside single images do not occupy low-dimensional locally connected sets after position removal, or that the learned dictionary fails to bound convex regions whose mixtures reconstruct the observed activations.

read the original abstract

DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper observes concrete patterns in DINOv2 SAE representations across tasks and geometry, then proposes the Minkowski Representation Hypothesis as convex mixtures of archetypes, but the data do not yet force that interpretation over simpler alternatives.

read the letter

The main takeaway is that the authors map how DINOv2 recruits concepts for different tasks and then argue that the resulting geometry fits convex mixtures of archetypes rather than pure linear features. They call this the Minkowski Representation Hypothesis and tie it to conceptual spaces and attention mechanics. That framing is the clearest new piece here. They do useful work showing task-specific patterns. Classification pulls concepts that fire everywhere except the target, which looks like learned negation. Segmentation uses boundary detectors that form subspaces. Depth estimation draws on three monocular cues that match known visual principles. These breakdowns give a practical sense of what the model actually computes on real tasks. The geometry observations are also straightforward: representations are only partly dense, the SAE dictionary moves toward coherence instead of staying maximally orthogonal, and tokens sit in low-dimensional locally connected sets that survive removal of position information. These are measurable patterns worth recording. The soft spot is the leap from those patterns to convex mixtures bounded by archetypes. Low-dimensional connected sets can arise from non-convex manifolds or unions of subspaces. Partial density can come from overlapping linear features without requiring convexity. Attention weights summing to one only guarantees affine combinations, not that the points lie inside specific archetype hulls. The paper does not appear to include direct tests that rule out these other structures, so the central claim rests more on interpretation than on decisive evidence. Readers working on ViT interpretability or geometric views of concepts will find the observations and the new hypothesis worth thinking about. The work is clear enough and grounded in actual model runs that it deserves a serious referee, though the authors will likely need to add controls that separate the convex-mixture story from other low-dimensional explanations.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes DINOv2 representations via a 32,000-unit SAE dictionary. It reports functional specialization across downstream tasks: classification recruits 'Elsewhere' negation concepts, segmentation uses boundary-detector subspaces, and depth estimation draws on three monocular cues aligned with neuroscience. Geometric and statistical analysis reveals partly dense (not strictly sparse) representations, dictionary evolution toward coherence rather than Grassmannian orthogonality, and low-dimensional locally connected token sets that persist after position removal. These observations are synthesized into the Minkowski Representation Hypothesis (MRH), which posits that tokens are convex mixtures of archetypes (e.g., rabbit/brown/fluffy) grounded in Gärdenfors conceptual spaces and the summation behavior of multi-head attention.

Significance. If the MRH is substantiated, the work would refine the Linear Representation Hypothesis by introducing convex geometry and archetype-bounded regions, with direct implications for mechanistic interpretability of vision transformers and potential bridges to cognitive science. The task-specific concept analyses and geometric signatures constitute concrete, reproducible observations that could serve as benchmarks for future studies.

major comments (2)

[Synthesis of observations] The synthesis section claims that partly dense representations, coherent dictionary evolution, and persistent low-dimensional locally connected token sets indicate convex-mixture organization of tokens. However, these signatures are also compatible with non-convex manifolds or unions of linear subspaces; no explicit controls, quantitative metrics, or ablation experiments are described that would rule out the alternatives and establish convexity specifically.
[Mechanism discussion] The mechanism paragraph states that multi-head attention produces sums of convex mixtures. While softmax weights do guarantee affine combinations (summing to 1), the manuscript does not provide evidence that the resulting points lie inside archetype-bounded convex regions as required by the MRH; the link from attention mechanics to Minkowski geometry therefore remains interpretive rather than demonstrated.

minor comments (2)

[Introduction of MRH] The reference to Gärdenfors' conceptual spaces is introduced without a full citation; add the standard reference in the main text and clarify which specific aspects of the theory are being invoked.
[Geometric and statistical analysis] Quantification of 'coherence' in the SAE dictionary (departure from Grassmannian frames) should be specified with the exact metric or statistic used, to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify the scope and limitations of our synthesis and mechanistic discussion. We respond to each major comment below.

read point-by-point responses

Referee: [Synthesis of observations] The synthesis section claims that partly dense representations, coherent dictionary evolution, and persistent low-dimensional locally connected token sets indicate convex-mixture organization of tokens. However, these signatures are also compatible with non-convex manifolds or unions of linear subspaces; no explicit controls, quantitative metrics, or ablation experiments are described that would rule out the alternatives and establish convexity specifically.

Authors: We agree that the reported signatures—partly dense representations, coherent dictionary evolution, and persistent low-dimensional locally connected token sets—are not unique to convex-mixture organization and could also arise from non-convex manifolds or unions of linear subspaces. The Minkowski Representation Hypothesis is presented as a synthesis that integrates these geometric observations with the task-specific functional results and the convex-combination properties of attention, motivated by Gärdenfors' conceptual spaces. The original manuscript does not include explicit ablations, quantitative convexity metrics, or controls to rule out the alternatives. In the revised version we will add a paragraph in the synthesis section that explicitly acknowledges these alternative interpretations and outlines potential future experiments, such as interpolation tests or hull-violation checks, to distinguish convexity more rigorously. revision: partial
Referee: [Mechanism discussion] The mechanism paragraph states that multi-head attention produces sums of convex mixtures. While softmax weights do guarantee affine combinations (summing to 1), the manuscript does not provide evidence that the resulting points lie inside archetype-bounded convex regions as required by the MRH; the link from attention mechanics to Minkowski geometry therefore remains interpretive rather than demonstrated.

Authors: The referee is correct that softmax attention produces affine combinations, but the manuscript does not supply direct empirical evidence that the resulting points lie inside archetype-bounded convex regions. The discussion interprets the summation behavior of multi-head attention as generating points within such regions by reference to the convex-hull structure implied by the archetypes and conceptual-spaces grounding. This link is therefore interpretive rather than demonstrated. We will revise the mechanism paragraph to distinguish clearly between the guaranteed affine-combination property and the additional convexity assumption of the MRH, and we will add a brief note suggesting empirical tests for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation of Minkowski Representation Hypothesis

full rationale

The paper first adopts the Linear Representation Hypothesis as a baseline and uses SAEs to extract a dictionary, then reports independent empirical observations on task recruitment, partial density of representations, dictionary coherence departing from orthogonality, and persistence of low-dimensional locally connected token sets after position removal. It synthesizes these into the MRH as an interpretive proposal that tokens form convex mixtures of archetypes, referencing Gärdenfors conceptual spaces and the known summation property of multi-head attention. No equations, fitted parameters, or self-citations are shown that reduce the MRH to a redefinition or statistical forcing of its own inputs; the hypothesis remains an external interpretive lens applied to separate observations rather than a tautological restatement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the adopted Linear Representation Hypothesis as a working baseline and on the interpretive leap that geometric signatures indicate convex mixtures; the MRH itself is introduced without independent falsifiable evidence beyond the described observations.

axioms (1)

domain assumption Linear Representation Hypothesis (LRH)
Adopted as working baseline to operationalize concepts via SAEs in the first paragraph of the abstract.

invented entities (2)

Minkowski Representation Hypothesis (MRH) no independent evidence
purpose: Refined geometric view that tokens are convex mixtures of archetypes bounded by attention sums.
Introduced in the synthesis section of the abstract as the organizing claim.
Archetypes no independent evidence
purpose: Basic elements (e.g., rabbit, brown, fluffy) whose convex mixtures form tokens.
Postulated in the synthesis paragraph to ground the MRH in conceptual spaces.

pith-pipeline@v0.9.0 · 5859 in / 1565 out tokens · 85098 ms · 2026-05-18T08:32:46.029736+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
cs.LG 2026-05 unverdicted novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
cs.AI 2026-05 unverdicted novelty 7.0

Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal
cs.CV 2026-04 conditional novelty 7.0

Visual encoders leak identity information; a one-shot linear subspace removal method (ISP) reduces leakage to near-chance levels while retaining high non-biometric utility across datasets.
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
cs.AI 2026-01 unverdicted novelty 7.0

DiTs use either a two-stage cross-attention circuit or text-token fusion circuit for spatial relations depending on the text encoder, achieving near-perfect in-domain accuracy but differing out-of-domain robustness.
A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle
cs.CV 2026-05 unverdicted novelty 6.0

The work introduces a distributional view of visual mechanistic interpretability that casts the task as KL-minimal optimization and realizes it through a soft-constraint principle implemented with energy-guided diffus...
Bilinear autoencoders find interpretable manifolds
cs.LG 2026-05 unverdicted novelty 6.0

Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation...