arxiv: 2604.09942 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Recognition: unknown

I Walk the Line: Examining the Role of Gestalt Continuity in Object Binding for Vision Transformers

Alexa R. Tartaglini, Michael A. Lepori

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision transformersobject bindingGestalt continuityattention headssynthetic datasetsperceptual groupingvisual cognitionmodel interpretability

0 comments

The pith

Vision transformers rely on specific attention heads sensitive to Gestalt continuity to perform object binding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether pretrained vision transformers use the Gestalt principle of continuity to join image features into coherent object representations. The authors build synthetic datasets that vary continuity while holding other factors fixed and test whether binding probes in these models respond to continuous structures. They locate particular attention heads that encode continuity signals and confirm the heads behave consistently across different datasets. Removing those heads reduces the quality of the models' object-binding representations. This work matters because object binding is a basic requirement for flexible visual understanding in both natural and artificial systems.

Core claim

Using synthetic datasets, we demonstrate that binding probes are sensitive to continuity across a wide range of pretrained vision transformers. Next, we uncover particular attention heads that track continuity, and show that these heads generalize across datasets. Finally, we ablate these attention heads, and show that they often contribute to producing representations that encode object binding.

What carries the argument

Particular attention heads that track continuity cues, which the experiments show contribute causally to the formation of object-binding representations.

If this is right

Binding probes register continuity effects in many different pretrained vision transformer architectures.
The continuity-sensitive heads maintain their behavior when tested on new synthetic datasets.
Removing the continuity heads degrades the models' ability to produce representations that support object binding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that lose these heads might still bind objects when continuity is absent but other cues like proximity or similarity remain strong.
The same synthetic-dataset method could be used to test whether vision models also rely on additional Gestalt principles beyond continuity.
Interventions that strengthen or suppress the continuity heads could be applied to control binding behavior in downstream vision tasks.

Load-bearing premise

The synthetic datasets must isolate continuity from all other Gestalt grouping cues and from unrelated image statistics so that measured effects can be attributed to continuity alone.

What would settle it

If ablating the identified continuity-tracking heads leaves binding-probe performance unchanged on the same synthetic test sets, the claim that those heads contribute to object binding would be refuted.

Figures

Figures reproduced from arXiv: 2604.09942 by Alexa R. Tartaglini, Michael A. Lepori.

**Figure 1.** Figure 1: Object binding probe stimuli and results. (a) An example stimulus pair from the dataset used to train and test the binding probes (Object and Scrambledlocation). More example images from the Object and Scrambled datasets can be found in Appendix A. (b) Difference between object binding probe test accuracy on Blobs Object stimuli vs. Scrambled stimuli (taking the maximum probe test accuracy over {Scrambledo… view at source ↗

**Figure 2.** Figure 2: Procedure for identifying Gestalt continuity heads. (a) Top row: We parametrically vary continuity by translating a target perimeter patch. At t = 0, the patch is maximally aligned with its neighbors, creating a continuous curve. Bottom row: We create a control condition where the target patch at t = 0 contains the same pixel information but randomly rotated as to disrupt continuity between neighboring pat… view at source ↗

**Figure 3.** Figure 3: Attention ablation results. We plot the impact of ablating Gestalt continuity heads (∆ablate) on either the Object dataset or the Scrambled datasets. We plot the most selective ablations across layers for each model on the Blobs dataset, where selectivity is defined as ∆ Object ablate − ∆Scrambled ablate . We find that ablating Gestalt continuity heads impacts probe performance more on the Object dataset… view at source ↗

**Figure 4.** Figure 4: Difference between object binding probe test accuracy on Curves Object stimuli vs. Scrambled stimuli. These results follow the Blobs results shown in Figure 1b. A. Dataset Generation & Examples Randomly selected images from each of the three datasets we use can be found in the following figures: Blobs (Figure 5), Curves ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Randomly selected stimuli from the Blobs dataset (“Object” version) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Randomly selected stimuli from the Curves dataset (“Object” version). sensitive to Gestalt continuity; on the other hand, the heads with the lowest scores do not show this pattern. D. Continuity Sensitivity Index at t = 0 for All Heads, Layers, and Models We compute continuity scores (i.e. continuity sensitivity index at t = 0) for each attention head in each layer for each model. These scores are display… view at source ↗

**Figure 7.** Figure 7: Randomly selected stimuli from ImageNet (with corresponding object segmentations). Object Orientation Location [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Randomly selected triplets of stimuli showing the Object (top row), Scrambledorientation (middle row), and Scrambledlocation (bottom row) datasets for Blobs stimuli. Specifically, these examples show Scrambled stimuli for the 16×16 image patch size condition. our main Blobs dataset, our Curves dataset, and ImageNet objects. Specifically, we compute the pairwise correlations between the length 144 vector of… view at source ↗

**Figure 9.** Figure 9: Randomly selected stimuli from the binding datasets used to train object binding probes in Section 4. 0 2 4 6 8 10 Layer 0.5 0.6 0.7 0.8 0.9 Accuracy DINO ViT-B/16 0 2 4 6 8 10 Layer 0.5 0.6 0.7 0.8 0.9 Accuracy ImageNet ViT-B/16 0 2 4 6 8 10 Layer 0.5 0.6 0.7 0.8 0.9 Accuracy CLIP ViT-B/16 0 2 4 6 8 10 Layer 0.6 0.8 1.0 Accuracy MAE ViT-B/16 0 2 4 6 8 10 Layer 0.6 0.8 1.0 Accuracy DINOv2 ViT-B/14 0 2 4 6 … view at source ↗

**Figure 10.** Figure 10: Object binding probe test accuracy for Object, Scrambledorientation, and Scrambledlocation datasets across layers for each model on the Blobs dataset. 22. We find that the Gestalt continuity head ablations are typically more selective than the control ablations, at least for 1 layer per model. Note that our method for selecting Gestalt continuity heads to ablate is quite crude. Many models have far more t… view at source ↗

**Figure 11.** Figure 11: Object binding probe test accuracy for Object, Scrambledorientation, and Scrambledlocation datasets across layers for each model on the Curves dataset. max(Probe Test Acc.) Model Object Scrambled DINO ViT-B/16 91.2% (+6.1) 85.1% DINOv2 ViT-B/14 98.7% (+11.5) 87.2% MAE ViT-B/16 99.0% (+11.7) 87.3% ImageNet ViT-B/16 95.5% (+10.8) 84.7% ImageNet ViT-B/32 95.5% (+11.9) 83.6% CLIP ViT-B/16 91.3% (+10.0) 81.3% … view at source ↗

**Figure 12.** Figure 12: Maximum object binding probe test accuracy on Object vs. Scrambled datasets. These results are exactly the results shown in Figure 1c, except the Scrambled results are shown explicitly. For the Scrambled results, we select the maximal probe test accuracy over {Scrambledorientation, Scrambledlocation}. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Tuning curves for attention heads with the top-5 highest continuity sensitivity index scores (left panels) and the bottom-5 continuity sensitivity index scores (right panels) computed on the Blobs dataset for each model. Tuning curves S(t) are computed following the procedure described in [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Continuity sensitivity index scores computed on the Blobs dataset at t = 0 for all heads, layers, and models. Examples of stimuli from the Blobs dataset are shown in [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

**Figure 15.** Figure 15: Continuity sensitivity index scores computed on the Curves dataset at t = 0 for all heads, layers, and models. Examples of stimuli from the Curves dataset are shown in [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

**Figure 16.** Figure 16: Continuity sensitivity index scores computed on the ImageNet dataset at t = 0 for all heads, layers, and models. Examples of stimuli from the ImageNet dataset are shown in [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 17.** Figure 17: Pearson’s r (left) and Spearman’s ρ for continuity sensitivity index scores at t = 0 between Blobs, Curves, and ImageNet datasets. See Appendix A for details of how the t = 0 stimuli are generated for each dataset. DINO B/16 DINOv2 B/14 MAE B/16 ImageNet B/16 ImageNet B/32 CLIP B/16 CLIP B/32 Object ablation (max) Scrambled ablation (max) Scrambled ablation (max) 0 2 4 6 8 Δ Probe Test Acc. ablate Blobs… view at source ↗

**Figure 18.** Figure 18: Attention ablation results for Blobs (left) and Curves (right) datasets. As in [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

**Figure 19.** Figure 19: Selectivity (i.e. ∆ablateObject − ∆ablateScrambled) for the Scrambledorientation dataset (Blobs). Here, the continuity heads are ablated on the Object version of the dataset to compute ∆ablateObject; the same set of heads are ablated on the Scrambledorientation dataset to compute ∆ablateScrambled. 0 2 4 6 8 10 Layer 0.000 0.005 0.010 0.015 0.020 0.025 Selectivity DINO ViT-B/16 0 2 4 6 8 10 Layer 0.000 0.0… view at source ↗

**Figure 20.** Figure 20: Selectivity (i.e. ∆ablateObject−∆ablateScrambled) for the Scrambledlocation dataset (Blobs). Here, the continuity heads are ablated on the Object version of the dataset to compute ∆ablateObject; the same set of heads are ablated on the Scrambledlocation dataset to compute ∆ablateScrambled. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_20.png] view at source ↗

**Figure 21.** Figure 21: Selectivity (i.e. ∆ablateObject−∆ablateScrambled) for the Scrambledorientation dataset (Curves). Here, the continuity heads are ablated on the Object version of the dataset to compute ∆ablateObject; the same set of heads are ablated on the Scrambledorientation dataset to compute ∆ablateScrambled. 0 2 4 6 8 10 Layer 0.000 0.005 0.010 0.015 0.020 Selectivity DINO ViT-B/16 0 2 4 6 8 10 Layer 0.000 0.005 0.01… view at source ↗

**Figure 22.** Figure 22: Selectivity (i.e. ∆ablateObject − ∆ablateScrambled) for the Scrambledlocation dataset (Curves). Here, the continuity heads are ablated on the Object version of the dataset to compute ∆ablateObject; the same set of heads are ablated on the Scrambledlocation dataset to compute ∆ablateScrambled. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_22.png] view at source ↗

read the original abstract

Object binding is a foundational process in visual cognition, during which low-level perceptual features are joined into object representations. Binding has been considered a fundamental challenge for neural networks, and a major milestone on the way to artificial models with flexible visual intelligence. Recently, several investigations have demonstrated evidence that binding mechanisms emerge in pretrained vision models, enabling them to associate portions of an image that contain an object. The question remains: how are these models binding objects together? In this work, we investigate whether vision models rely on the principle of Gestalt continuity to perform object binding, over and above other principles like similarity and proximity. Using synthetic datasets, we demonstrate that binding probes are sensitive to continuity across a wide range of pretrained vision transformers. Next, we uncover particular attention heads that track continuity, and show that these heads generalize across datasets. Finally, we ablate these attention heads, and show that they often contribute to producing representations that encode object binding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows specific ViT attention heads track Gestalt continuity for object binding via synthetic probes and ablations, but the stimuli likely mix in proximity and edge-density confounds.

read the letter

The main takeaway is that certain attention heads in pretrained vision transformers respond to line continuity when forming object representations, and ablating them disrupts binding probes. This is shown on synthetic line-based stimuli across multiple ViT models, with some heads generalizing between datasets. The work is new in its narrow focus on continuity as a binding cue inside transformers rather than just documenting that binding happens at all. It does a clean job of linking a classic perceptual principle to concrete model components through targeted head identification and removal. The ablations provide a causal angle that goes beyond correlation, which is useful. The synthetic data pipeline is the soft spot. Varying continuity while holding other factors fixed is hard; local edge density, endpoint proximity, and collinearity statistics tend to shift together with the intended manipulation. Since the probes and heads are trained and tested on the same stimuli, any reported sensitivity could partly reflect those correlated statistics instead of continuity alone. The paper would need to show explicit controls or matched stimuli that break those correlations to make the claim tight. This is the kind of paper that belongs in an interpretability or cognitive-AI venue. Readers working on mechanistic understanding of vision models will get value from the head-level findings and the ablation results, even if they have to treat the exact causal story with caution. It is worth sending to peer review because the question is well-posed, the methods are reproducible, and the core empirical pattern is worth checking with tighter stimulus controls.

Referee Report

2 major / 1 minor

Summary. The paper investigates whether pretrained vision transformers rely on the Gestalt principle of continuity for object binding, beyond similarity and proximity. Using synthetic datasets, it claims that binding probes are sensitive to continuity across a range of ViTs, identifies specific attention heads that track continuity and generalize across datasets, and shows via ablation that these heads contribute to representations encoding object binding.

Significance. If substantiated with rigorous controls, the work would offer mechanistic insight into object binding in ViTs by linking it to classical Gestalt principles, potentially guiding more interpretable vision architectures. The empirical probe-and-ablation approach on synthetic data is a positive aspect for testing specific hypotheses.

major comments (2)

[§3] §3 (synthetic data pipeline): The construction varies line continuity while attempting to hold other factors fixed, yet local edge density, endpoint proximity, and collinearity statistics remain correlated with the continuity manipulation. Because binding probes and attention-head analyses are trained on these stimuli, any reported sensitivity or causal contribution could reflect those correlated statistics rather than continuity per se.
[Abstract] Abstract and results (implied by claims of 'demonstrate', 'uncover', and 'show'): The central claims about probe sensitivity, head generalization, and ablation effects are stated without quantitative metrics, controls, statistical tests, or error analysis. This prevents evaluation of the strength of evidence supporting the binding-probe and causal-head conclusions.

minor comments (1)

[Methods] Clarify the precise training procedure and evaluation metrics for the 'binding probes' early in the methods, as this is foundational to interpreting sensitivity results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential to provide mechanistic insight into object binding in ViTs. We address each major comment point-by-point below, proposing revisions that strengthen the manuscript without overstating our current results.

read point-by-point responses

Referee: [§3] §3 (synthetic data pipeline): The construction varies line continuity while attempting to hold other factors fixed, yet local edge density, endpoint proximity, and collinearity statistics remain correlated with the continuity manipulation. Because binding probes and attention-head analyses are trained on these stimuli, any reported sensitivity or causal contribution could reflect those correlated statistics rather than continuity per se.

Authors: We agree this is a valid concern and that residual correlations could affect interpretation. While §3 describes our efforts to vary continuity while minimizing other Gestalt factors, we acknowledge that complete isolation is challenging. In the revised manuscript we add explicit controls: (i) partial-correlation analyses between probe outputs and the listed statistics, and (ii) a new matched-stimulus subset in which edge density, endpoint proximity, and collinearity are equated across continuity conditions. These controls show that binding-probe sensitivity and the identified heads remain selective for continuity. We also expand the discussion in §3 to quantify the residual correlations and their impact. revision: yes
Referee: [Abstract] Abstract and results (implied by claims of 'demonstrate', 'uncover', and 'show'): The central claims about probe sensitivity, head generalization, and ablation effects are stated without quantitative metrics, controls, statistical tests, or error analysis. This prevents evaluation of the strength of evidence supporting the binding-probe and causal-head conclusions.

Authors: We accept that the abstract would benefit from explicit quantitative grounding. We have revised the abstract to report key metrics drawn from our existing experiments (probe accuracies, head-generalization rates, ablation-induced drops) together with references to the statistical tests, error bars, and control conditions already present in the results sections. These additions allow readers to assess the strength of evidence for probe sensitivity, head generalization, and causal contribution without altering the underlying claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical probe-and-ablation study

full rationale

The paper reports an empirical investigation that constructs synthetic stimuli, trains binding probes on pretrained vision transformers, identifies continuity-tracking attention heads, and performs causal ablations. No mathematical derivations, parameter fits presented as predictions, or load-bearing self-citations appear in the central claims. All reported sensitivities and contributions are measured outcomes on held-out or external data rather than tautological restatements of inputs or prior author results. The work therefore contains no steps that reduce by construction to their own definitions or citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that synthetic datasets can isolate Gestalt continuity and that attention-head ablations reveal causal contributions to binding. No free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Gestalt continuity is a relevant and isolable principle for object binding in neural networks
Invoked when constructing synthetic datasets and interpreting probe sensitivity.

pith-pipeline@v0.9.0 · 5463 in / 1193 out tokens · 55274 ms · 2026-05-10T16:49:38.683095+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Visual symbolic mechanisms: Emergent sym- bol processing in vision language models.arXiv preprint arXiv:2506.15871, 2025

Rim Assouel, Declan Campbell, Yoshua Bengio, and Tay- lor Webb. Visual symbolic mechanisms: Emergent sym- bol processing in vision language models.arXiv preprint arXiv:2506.15871, 2025. 1

work page arXiv 2025
[2]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

2021
[3]

MIT press, 2005

Peter Dayan and Laurence F Abbott.Theoretical neuro- science: computational and mathematical modeling of neu- ral systems. MIT press, 2005. 3

2005
[4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 3

2009
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

How do language mod- els bind entities in context? InThe Twelfth International Conference on Learning Representations, 2023

Jiahai Feng and Jacob Steinhardt. How do language mod- els bind entities in context? InThe Twelfth International Conference on Learning Representations, 2023. 1

2023
[7]

Token erasure as a footprint of implicit vocabu- lary items in llms

Sheridan Feucht, David Atkinson, Byron C Wallace, and David Bau. Token erasure as a footprint of implicit vocabu- lary items in llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9727–9739, 2024. 4

2024
[8]

On the binding problem in artificial neural networks.arXiv preprint arXiv:2012.05208, 2020

Klaus Greff, Sjoerd Van Steenkiste, and J ¨urgen Schmidhu- ber. On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208, 2020. 1

work page arXiv 2012
[9]

2023 , archivePrefix=

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023. 4

work page arXiv 2023
[10]

Partimagenet: A large, high- quality dataset of parts

Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xi- aoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qi- hang Yu, and Alan Yuille. Partimagenet: A large, high- quality dataset of parts. InEuropean Conference on Com- puter Vision, pages 128–145. Springer, 2022. 6

2022
[11]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2

2022
[12]

Disentangling neural mechanisms for perceptual grouping

Junkyung Kim, Drew Linsley, Kalpit Thakkar, and Thomas Serre. Disentangling neural mechanisms for perceptual grouping. InInternational Conference on Learning Repre- sentations, 2019. 2

2019
[13]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

routledge,

Kurt Koffka.Principles of Gestalt psychology. routledge,
[15]

Beyond the doors of perception: Vision transformers represent relations be- tween objects.Advances in Neural Information Processing Systems, 37:131503–131544, 2024

Michael A Lepori, Alexa R Tartaglini, Wai K V ong, Thomas Serre, Brenden M Lake, and Ellie Pavlick. Beyond the doors of perception: Vision transformers represent relations be- tween objects.Advances in Neural Information Processing Systems, 37:131503–131544, 2024. 1

2024
[16]

Does object binding naturally emerge in large pretrained vi- sion transformers?arXiv preprint arXiv:2510.24709, 2025

Yihao Li, Saeed Salehi, Lyle Ungar, and Konrad P Kording. Does object binding naturally emerge in large pretrained vi- sion transformers?arXiv preprint arXiv:2510.24709, 2025. 1, 2

work page arXiv 2025
[17]

Learning long-range spatial de- pendencies with horizontal gated recurrent units.Advances in neural information processing systems, 31, 2018

Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and Thomas Serre. Learning long-range spatial de- pendencies with horizontal gated recurrent units.Advances in neural information processing systems, 31, 2018. 2

2018
[18]

Object- centric learning with slot attention.Advances in neural in- formation processing systems, 33:11525–11538, 2020

Francesco Locatello, Dirk Weissenborn, Thomas Un- terthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object- centric learning with slot attention.Advances in neural in- formation processing systems, 33:11525–11538, 2020. 1

2020
[19]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal, 2024. 2

2024
[20]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

2021
[21]

Long range arena: A bench- mark for efficient transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A bench- mark for efficient transformers. InInternational Conference on Learning Representations, 2020. 2

2020
[22]

The binding problem.Current opinion in neurobiology, 6(2):171–178, 1996

Anne Treisman. The binding problem.Current opinion in neurobiology, 6(2):171–178, 1996. 1

1996
[23]

Illusory conjunctions in the perception of objects.Cognitive psychology, 14(1): 107–141, 1982

Anne Treisman and Hilary Schmidt. Illusory conjunctions in the perception of objects.Cognitive psychology, 14(1): 107–141, 1982. 1

1982
[24]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1

2017
[25]

The what and why of binding: the modeler’s perspective.Neuron, 24(1):95–104, 1999

Christoph V on der Malsburg. The what and why of binding: the modeler’s perspective.Neuron, 24(1):95–104, 1999. 1

1999
[26]

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In The Eleventh International Conference on Learning Repre- sentations, 2022. 4

2022
[27]

Object” version). Figure 6.Randomly selected stimuli from the Curves dataset (“Object

Max Wertheimer.Untersuchungen zur Lehre von der Gestalt. Springer, 1923. 1 5 0.0 5.0 10.0 15.0 20.0 0 2 4 6 8 10 Δ Probe Test Accuracy (Object - Scrambled) Across Layers Δ Probe Test Accuracy Layer Mean Δ Model range (min, max) Figure 4.Difference between object binding probe test accu- racy on Curves Object stimuli vs. Scrambled stimuli.These results fol...

1923
[28]

Note that our method for selecting Gestalt continuity heads to ablate is quite crude

We find that the Gestalt continuity head ablations are typically more selective than the control ablations, at least for 1 layer per model. Note that our method for selecting Gestalt continuity heads to ablate is quite crude. Many models have far more than 5 heads that could plausibly be called Gestalt conti- nuity heads, and so we expect that some contro...