Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry
Pith reviewed 2026-05-18 08:32 UTC · model grok-4.3
The pith
DINOv2 tokens form as convex mixtures of archetypes such as a rabbit among animals or brown among colors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tokens are formed by combining convex mixtures of archetypes. This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis and examine its empirical signatures and implications for interpreting vision-transformer representations.
What carries the argument
Minkowski Representation Hypothesis (MRH): the proposal that tokens arise as convex mixtures of archetype vectors whose sums define bounded conceptual regions.
If this is right
- Classification exploits negation concepts that activate everywhere except on target objects.
- Segmentation draws on coherent subspaces of boundary detectors.
- Depth estimation recruits three distinct monocular cues that align with known visual neuroscience.
- The dictionary evolves away from orthogonal ideals toward greater internal coherence.
- Interpretability tools must treat representations as sums of convex mixtures rather than isolated linear features.
Where Pith is reading between the lines
- If MRH holds, editing a small set of archetype directions could control higher-level object properties in a model.
- The same convex-mixture structure may appear in other transformer families once their representations are probed with comparable dictionaries.
- Connecting the hypothesis to Gardenfors' theory opens the possibility of mapping model archetypes onto human-perceived quality dimensions such as hue or texture.
Load-bearing premise
The observed density, coherence, and low-dimensional connected token sets directly indicate convex-mixture organization rather than other non-convex or non-archetype structures.
What would settle it
Showing that tokens inside single images do not occupy low-dimensional locally connected sets after position removal, or that the learned dictionary fails to bound convex regions whose mixtures reconstruct the observed activations.
read the original abstract
DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes DINOv2 representations via a 32,000-unit SAE dictionary. It reports functional specialization across downstream tasks: classification recruits 'Elsewhere' negation concepts, segmentation uses boundary-detector subspaces, and depth estimation draws on three monocular cues aligned with neuroscience. Geometric and statistical analysis reveals partly dense (not strictly sparse) representations, dictionary evolution toward coherence rather than Grassmannian orthogonality, and low-dimensional locally connected token sets that persist after position removal. These observations are synthesized into the Minkowski Representation Hypothesis (MRH), which posits that tokens are convex mixtures of archetypes (e.g., rabbit/brown/fluffy) grounded in Gärdenfors conceptual spaces and the summation behavior of multi-head attention.
Significance. If the MRH is substantiated, the work would refine the Linear Representation Hypothesis by introducing convex geometry and archetype-bounded regions, with direct implications for mechanistic interpretability of vision transformers and potential bridges to cognitive science. The task-specific concept analyses and geometric signatures constitute concrete, reproducible observations that could serve as benchmarks for future studies.
major comments (2)
- [Synthesis of observations] The synthesis section claims that partly dense representations, coherent dictionary evolution, and persistent low-dimensional locally connected token sets indicate convex-mixture organization of tokens. However, these signatures are also compatible with non-convex manifolds or unions of linear subspaces; no explicit controls, quantitative metrics, or ablation experiments are described that would rule out the alternatives and establish convexity specifically.
- [Mechanism discussion] The mechanism paragraph states that multi-head attention produces sums of convex mixtures. While softmax weights do guarantee affine combinations (summing to 1), the manuscript does not provide evidence that the resulting points lie inside archetype-bounded convex regions as required by the MRH; the link from attention mechanics to Minkowski geometry therefore remains interpretive rather than demonstrated.
minor comments (2)
- [Introduction of MRH] The reference to Gärdenfors' conceptual spaces is introduced without a full citation; add the standard reference in the main text and clarify which specific aspects of the theory are being invoked.
- [Geometric and statistical analysis] Quantification of 'coherence' in the SAE dictionary (departure from Grassmannian frames) should be specified with the exact metric or statistic used, to allow replication.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify the scope and limitations of our synthesis and mechanistic discussion. We respond to each major comment below.
read point-by-point responses
-
Referee: [Synthesis of observations] The synthesis section claims that partly dense representations, coherent dictionary evolution, and persistent low-dimensional locally connected token sets indicate convex-mixture organization of tokens. However, these signatures are also compatible with non-convex manifolds or unions of linear subspaces; no explicit controls, quantitative metrics, or ablation experiments are described that would rule out the alternatives and establish convexity specifically.
Authors: We agree that the reported signatures—partly dense representations, coherent dictionary evolution, and persistent low-dimensional locally connected token sets—are not unique to convex-mixture organization and could also arise from non-convex manifolds or unions of linear subspaces. The Minkowski Representation Hypothesis is presented as a synthesis that integrates these geometric observations with the task-specific functional results and the convex-combination properties of attention, motivated by Gärdenfors' conceptual spaces. The original manuscript does not include explicit ablations, quantitative convexity metrics, or controls to rule out the alternatives. In the revised version we will add a paragraph in the synthesis section that explicitly acknowledges these alternative interpretations and outlines potential future experiments, such as interpolation tests or hull-violation checks, to distinguish convexity more rigorously. revision: partial
-
Referee: [Mechanism discussion] The mechanism paragraph states that multi-head attention produces sums of convex mixtures. While softmax weights do guarantee affine combinations (summing to 1), the manuscript does not provide evidence that the resulting points lie inside archetype-bounded convex regions as required by the MRH; the link from attention mechanics to Minkowski geometry therefore remains interpretive rather than demonstrated.
Authors: The referee is correct that softmax attention produces affine combinations, but the manuscript does not supply direct empirical evidence that the resulting points lie inside archetype-bounded convex regions. The discussion interprets the summation behavior of multi-head attention as generating points within such regions by reference to the convex-hull structure implied by the archetypes and conceptual-spaces grounding. This link is therefore interpretive rather than demonstrated. We will revise the mechanism paragraph to distinguish clearly between the guaranteed affine-combination property and the additional convexity assumption of the MRH, and we will add a brief note suggesting empirical tests for future work. revision: yes
Circularity Check
No significant circularity in derivation of Minkowski Representation Hypothesis
full rationale
The paper first adopts the Linear Representation Hypothesis as a baseline and uses SAEs to extract a dictionary, then reports independent empirical observations on task recruitment, partial density of representations, dictionary coherence departing from orthogonality, and persistence of low-dimensional locally connected token sets after position removal. It synthesizes these into the MRH as an interpretive proposal that tokens form convex mixtures of archetypes, referencing Gärdenfors conceptual spaces and the known summation property of multi-head attention. No equations, fitted parameters, or self-citations are shown that reduce the MRH to a redefinition or statistical forcing of its own inputs; the hypothesis remains an external interpretive lens applied to separate observations rather than a tautological restatement.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear Representation Hypothesis (LRH)
invented entities (2)
-
Minkowski Representation Hypothesis (MRH)
no independent evidence
-
Archetypes
no independent evidence
Forward citations
Cited by 6 Pith papers
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
-
From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal
Visual encoders leak identity information; a one-shot linear subspace removal method (ISP) reduces leakage to near-chance levels while retaining high non-biometric utility across datasets.
-
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
DiTs use either a two-stage cross-attention circuit or text-token fusion circuit for spatial relations depending on the text encoder, achieving near-perfect in-domain accuracy but differing out-of-domain robustness.
-
A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle
The work introduces a distributional view of visual mechanistic interpretability that casts the task as KL-minimal optimization and realizes it through a soft-constraint principle implemented with energy-guided diffus...
-
Bilinear autoencoders find interpretable manifolds
Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.