Recognition: 2 theorem links
· Lean TheoremWhat-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization
Pith reviewed 2026-05-13 06:55 UTC · model grok-4.3
The pith
What-Where Transformer separates object appearance from location in concurrent streams to produce emergent multi-object discovery from raw attention maps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By processing tokens as what-representations and attention maps as where-representations in concurrent feed-forward modules of a multi-stream slot-based architecture, the What-Where Transformer achieves what-where separation throughout an attentive backbone. The final-layer tokens and attention maps are reused directly for downstream tasks and exposed to task-loss gradients, enabling effective localization learning. Even when trained only with single-label classification supervision on ImageNet, the model exhibits emergent multiple object discovery directly from its raw attention maps without token clustering or other post-processing.
What carries the argument
A multi-stream, slot-based architecture that processes tokens (what-representations) and attention maps (where-representations) in concurrent feed-forward modules.
If this is right
- Achieves higher performance than ViT-based methods on zero-shot object discovery.
- Outperforms prior approaches on weakly supervised semantic segmentation.
- Transfers to multiple localization setups with only minimal architectural changes.
- Produces multiple object discovery directly from raw attention maps without clustering or other post-processing.
Where Pith is reading between the lines
- The same separation could simplify end-to-end pipelines for dense prediction tasks by removing the need for separate localization heads or clustering stages.
- Because the maps are already exposed to gradients, the model might support fine-grained localization even when only coarse labels are available during training.
- The concurrent what-where streams might be combined with existing object-centric models to improve slot binding without changing the supervision regime.
Load-bearing premise
That treating tokens and attention maps as separate what and where streams in concurrent modules will keep the two kinds of information from entangling and will allow localization to be learned from task losses alone.
What would settle it
Train the model on standard single-label ImageNet classification and check whether the raw final-layer attention maps contain spatially distinct activations for multiple separate objects in the same image; failure to observe such activations would falsify the emergence claim.
Figures
read the original abstract
Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture; (2) it reuses both the final-layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single-label classification-based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT-based methods on zero-shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the What-Where Transformer (WWT), a slot-centric Vision Transformer variant that enforces an inductive bias for what-where separation. Tokens are treated as what-representations and attention maps as where-representations; these are processed in concurrent multi-stream feed-forward modules. The final-layer tokens and attention maps are reused for downstream tasks and directly optimized by task losses. The central empirical claim is that, under standard single-label ImageNet classification supervision, WWT produces emergent multiple-object discovery directly from raw final-layer attention maps without token clustering or other post-processing, while also improving zero-shot object discovery and weakly-supervised semantic segmentation relative to ViT baselines and transferring to other localization tasks.
Significance. If the empirical claims are substantiated with proper controls, the work would be moderately significant for vision backbones: it offers a concrete architectural mechanism to reduce entanglement between semantic and spatial information without requiring explicit localization supervision or auxiliary losses. The reported transferability to multiple localization setups and the avoidance of post-processing steps would be useful if the separation is shown to be robust rather than an artifact of the slot design.
major comments (3)
- [Abstract, §4] Abstract and §4 (experimental results): The claim of 'emergent multiple object discovery directly from raw attention maps' without post-processing is load-bearing for the novelty argument, yet the manuscript does not specify the exact extraction procedure (e.g., whether per-head selection, averaging, or simple thresholding is applied before visualization or metric computation). If any such step is used, it must be shown to be strictly weaker than the token-clustering baselines it is contrasted against; otherwise the separation advantage is not cleanly demonstrated.
- [§3.2] §3.2 (architecture) and ablation studies: The concurrent what/where feed-forward modules are presented as the source of clean decomposition, but no direct ablation compares WWT against a standard ViT with identical slot count and attention-map reuse under the same ImageNet supervision. Without this control, it remains unclear whether the observed localization gains arise from the what-where split or simply from the multi-stream slot architecture.
- [Table 2, Table 3] Table 2 (zero-shot discovery) and Table 3 (weakly-supervised segmentation): Performance numbers are reported without standard deviations across multiple runs or seeds, and the baselines appear to use the same ViT backbone without the concurrent modules. This makes it difficult to assess whether the reported gains are statistically reliable or attributable to the proposed separation rather than hyper-parameter differences.
minor comments (2)
- [§3] Notation for the slot streams and the reuse of attention maps for gradient flow should be introduced with a single diagram and consistent symbols in §3; current prose descriptions are occasionally ambiguous about which tensors receive task gradients.
- The manuscript states that code will be released after acceptance; adding a reproducibility checklist (data splits, exact hyper-parameters, and the precise attention-map extraction code) would strengthen the empirical claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (experimental results): The claim of 'emergent multiple object discovery directly from raw attention maps' without post-processing is load-bearing for the novelty argument, yet the manuscript does not specify the exact extraction procedure (e.g., whether per-head selection, averaging, or simple thresholding is applied before visualization or metric computation). If any such step is used, it must be shown to be strictly weaker than the token-clustering baselines it is contrasted against; otherwise the separation advantage is not cleanly demonstrated.
Authors: We will revise the manuscript to explicitly detail the extraction procedure. The final-layer attention maps are used in their raw form for both visualization and quantitative metrics (e.g., object discovery evaluation), with only standard multi-head averaging applied as is conventional in ViT attention analysis—no per-head selection, thresholding, clustering, or other post-processing steps. This procedure is indeed minimal and weaker than the token-clustering baselines we compare against, directly supporting the emergent separation claim. Updated description and examples will be added to §4 and the appendix. revision: yes
-
Referee: [§3.2] §3.2 (architecture) and ablation studies: The concurrent what/where feed-forward modules are presented as the source of clean decomposition, but no direct ablation compares WWT against a standard ViT with identical slot count and attention-map reuse under the same ImageNet supervision. Without this control, it remains unclear whether the observed localization gains arise from the what-where split or simply from the multi-stream slot architecture.
Authors: This is a fair point on isolating the contribution of the concurrent modules. Standard ViT lacks native slot-centric processing and direct attention-map reuse, so a perfect 1:1 control is not straightforward. However, we will add a new ablation in the revised §3.2 and experiments comparing WWT to a merged single-stream slot variant (same slot count, attention reuse, and supervision) to isolate the effect of the what/where split. This will clarify that the gains stem from the concurrent design rather than slots alone. revision: partial
-
Referee: [Table 2, Table 3] Table 2 (zero-shot discovery) and Table 3 (weakly-supervised segmentation): Performance numbers are reported without standard deviations across multiple runs or seeds, and the baselines appear to use the same ViT backbone without the concurrent modules. This makes it difficult to assess whether the reported gains are statistically reliable or attributable to the proposed separation rather than hyper-parameter differences.
Authors: We agree that standard deviations would enhance statistical reliability. Due to compute limits in the original runs, we reported single-run results, but we will re-execute the key experiments across 3 seeds and update Tables 2 and 3 with means ± std. Baselines were reimplemented under matched hyperparameters and training protocols where feasible; we will add explicit notes on any minor differences in the text and appendix to rule out confounds. revision: yes
Circularity Check
No significant circularity; claims rest on explicit architectural inductive bias rather than self-referential fits or citations.
full rationale
The paper defines WWT via two explicit design choices—treating tokens as what-representations and attention maps as where-representations in concurrent slot-based feed-forward modules, plus direct exposure of both to task-loss gradients—without any equation that reduces the claimed what-where separation or emergent discovery to a quantity fitted from the same data or imported via self-citation. The abstract presents the multiple-object discovery result as an empirical outcome under standard ImageNet supervision, not as a prediction derived from the architecture's own fitted parameters. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling appears in the derivation chain; the central separation is an imposed inductive bias whose effectiveness is evaluated externally.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
WWT exhibits emergent multiple object discovery directly from raw attention maps
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Quantifying attention flow in transformers
Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InAnnual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020
work page 2020
-
[2]
MONet: Unsupervised Scene Decomposition and Representation
Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. NONet: Unsupervised scene decomposition and representation.arXiv preprint arXiv:1901.11390, 2019
work page Pith review arXiv 1901
-
[3]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean Conference on Computer Vision, pages 213–229, 2020
work page 2020
-
[4]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021
work page 2021
-
[5]
Weakly-supervised semantic segmentation via sub-category exploration
Yu-Ting Chang, Qiaosong Wang, Wei-Chih Hung, Robinson Piramuthu, Yi-Hsuan Tsai, and Ming-Hsuan Yang. Weakly-supervised semantic segmentation via sub-category exploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8991–9000, 2020
work page 2020
-
[6]
Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks
Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. InWinter Conference on Applications of Computer Vision, pages 839–847. IEEE, 2018
work page 2018
-
[7]
Mobile-former: Bridging mobilenet and transformer
Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile-former: Bridging mobilenet and transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5270–5279, 2022
work page 2022
-
[8]
Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. Advances in Neural Information Processing Systems, 30, 2017
work page 2017
-
[9]
Zeren Chen, Gengshi Huang, Wei Li, Jianing Teng, Kun Wang, Jing Shao, Chen Change Loy, and Lu Sheng. Siamese DETR. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15722–15731, 2023
work page 2023
-
[10]
Masked- attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked- attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022
work page 2022
-
[11]
Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation.Advances in Neural Information Processing Systems, 34:17864–17875, 2021
work page 2021
-
[12]
Minkyu Choi, Kuan Han, Xiaokai Wang, Yizhen Zhang, and Zhongming Liu. A dual-stream neural network explains the functional segregation of dorsal and ventral visual pathways in human brains. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[13]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint, arXiv:1412.3555, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
Multi-column deep neural networks for image classification
Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3642–3649. IEEE, 2012
work page 2012
-
[15]
Deep feature factorization for concept discovery
Edo Collins, Radhakrishna Achanta, and Sabine Susstrunk. Deep feature factorization for concept discovery. InEuropean Conference on Computer Vision, pages 336–352, 2018
work page 2018
-
[16]
UP-DETR: Unsupervised pre-training for object detection with transformers
Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. UP-DETR: Unsupervised pre-training for object detection with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1601–1610, 2021. 10
work page 2021
-
[17]
Vision transformers need registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations, 2024
work page 2024
-
[18]
Scaling vision transformers to 22 billion parameters
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, An- dreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational Conference on Machine Learning, pages 7480– 7512, 2023
work page 2023
-
[19]
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009
work page 2009
-
[20]
Zhiwei Deng, Ting Chen, and Yang Li. Perceptual group tokenizer: Building perception with iterative grouping.International Conference on Learning Representations, 2024
work page 2024
-
[21]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations
-
[22]
Ventral-dorsal neural networks: object detection via selective attention
Mohammad K Ebrahimpour, Jiayun Li, Yen-Yun Yu, Jackson Reesee, Azadeh Moghtaderi, Ming-Hsuan Yang, and David C Noelle. Ventral-dorsal neural networks: object detection via selective attention. In Winter Conference on Applications of Computer Vision, pages 986–994. IEEE, 2019
work page 2019
-
[23]
CRAFT: Concept recursive activation factorization for explainability
Thomas Fel, Agustin Picard, Louis Bethune, Thibaut Boissin, David Vigouroux, Julien Colin, Rémi Cadène, and Thomas Serre. CRAFT: Concept recursive activation factorization for explainability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2711–2721, 2023
work page 2023
-
[24]
Shanghua Gao, Zhong-Yu Li, Ming-Hsuan Yang, Ming-Ming Cheng, Junwei Han, and Philip Torr. Large-scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2022
work page 2022
-
[25]
Multi-fold MIL training for weakly supervised object localization
Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Multi-fold MIL training for weakly supervised object localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2409–2416, 2014
work page 2014
-
[26]
Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, and Michael C Mozer. Decoupling the “what” and “where” with polar coordinate positional embeddings.arXiv preprint arXiv:2509.10534, 2025
-
[27]
Anirudh Goyal and Yoshua Bengio. Inductive biases for deep learning of higher-level cognition.Proceed- ings of the Royal Society A, 478(2266):20210068, 2022
work page 2022
-
[28]
Karo Gregor and Yann LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields.arXiv preprint arXiv:1006.0448, 2010
-
[29]
ViTOL: Vision transformer for weakly supervised object localization
Saurav Gupta, Sourav Lakhotia, Abhay Rawat, and Rahul Tallamraju. ViTOL: Vision transformer for weakly supervised object localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4101–4110, 2022
work page 2022
-
[30]
Yuzhe Hao, Asako Kanezaki, Ikuro Sato, Rei Kawakami, and Koichi Shinoda. Egocentric human activities recognition with multimodal interaction sensing.IEEE Sensors Journal, 24(5):7085–7096, 2024
work page 2024
-
[31]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016
work page 2016
-
[32]
Orris C Herfindahl.Concentration in the steel industry. Columbia university, 1997
work page 1997
-
[33]
Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. InInternational conference on artificial neural networks, pages 44–51. Springer, 2011
work page 2011
-
[34]
Jindong Jiang, Fei Deng, Gautam Singh, and Sungjin Ahn. Object-centric slot diffusion.Advances in Neural Information Processing Systems, arXiv:2303.10834, 2023
-
[35]
Ioannis Kakogeorgiou, Spyros Gidaris, Konstantinos Karantzalos, and Nikos Komodakis. SPOT: Self- training with patch-order permutation for object-centric learning with autoregressive transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22776– 22786, 2024. 11
work page 2024
-
[36]
Cross-connected networks for multi-task learning of detection and segmentation
Rei Kawakami, Ryota Yoshihashi, Seiichiro Fukuda, Shaodi You, Makoto Iida, and Takeshi Naemura. Cross-connected networks for multi-task learning of detection and segmentation. pages 3636–3640. IEEE, 2019
work page 2019
-
[37]
On permutation- invariant neural networks.arXiv preprint arXiv:2403.17410, 2024
Masanari Kimura, Ryotaro Shimizu, Yuki Hirakawa, Ryosuke Goto, and Yuki Saito. On permutation- invariant neural networks.arXiv preprint arXiv:2403.17410, 2024
-
[38]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023
work page 2023
-
[39]
Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation
Jungbeom Lee, Eunji Kim, and Sungroh Yoon. Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021
work page 2021
-
[40]
Scouter: Slot attention-based classifier for explainable image recognition
Liangzhi Li, Bowen Wang, Manisha Verma, Yuta Nakashima, Ryo Kawasaki, and Hajime Nagahara. Scouter: Slot attention-based classifier for explainable image recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1046–1055, 2021
work page 2021
-
[41]
Exploring plain vision transformer backbones for object detection
Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. InEuropean Conference on Computer Vision, pages 280–296, 2022
work page 2022
-
[42]
Token activation map to visually explain multimodal llms
Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, and Xiaomeng Li. Token activation map to visually explain multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 48–58, 2025
work page 2025
-
[43]
Pay attention to mlps.Advances in Neural Information Processing Systems, 34:9204–9215, 2021
Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps.Advances in Neural Information Processing Systems, 34:9204–9215, 2021
work page 2021
-
[44]
Kang Jun Liu, Masanori Suganuma, and Takayuki Okatani. Self-supervised learning of intertwined content and positional features for object detection.International Conference on Machine Learning, 267:39552–39567, 2025
work page 2025
-
[45]
Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the CoordConv solution.Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[46]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021
work page 2021
-
[47]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022
work page 2022
-
[48]
Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention.Advances in Neural Information Processing Systems, 33:11525–11538, 2020
work page 2020
-
[49]
Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, and Yun Fu. Image as set of points. International Conference on Learning Representations, 2023
work page 2023
-
[50]
A. David Milner and Melvyn A. Goodale.The Visual Brain in Action. Oxford University Press, Oxford, 1995
work page 1995
-
[51]
Simple open- vocabulary object detection
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Doso- vitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open- vocabulary object detection. InEuropean Conference on Computer Vision, pages 728–755. Springer, 2022
work page 2022
-
[52]
Object vision and spatial vision: two cortical pathways.Trends in neurosciences, 6:414–417, 1983
Mortimer Mishkin, Leslie G Ungerleider, and Kathleen A Macko. Object vision and spatial vision: two cortical pathways.Trends in neurosciences, 6:414–417, 1983
work page 1983
-
[53]
Cross-stitch networks for multi-task learning
Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3994–4003, 2016
work page 2016
-
[54]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024. 12
work page 2024
-
[55]
Bill Psomas, Ioannis Kakogeorgiou, Konstantinos Karantzalos, and Yannis Avrithis. Keep it SimPool: Who said supervised transformers suffer from attention deficit? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5350–5360, 2023
work page 2023
-
[56]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[57]
MOST: Multiple object localization with self-supervised transformers for object discovery
Sai Saketh Rambhatla, Ishan Misra, Rama Chellappa, and Abhinav Shrivastava. MOST: Multiple object localization with self-supervised transformers for object discovery. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15823–15834, 2023
work page 2023
-
[58]
Vision transformers for dense prediction
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12179–12188, 2021
work page 2021
-
[59]
Finding distributed object-centric properties in self-supervised transformers
Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, and Narendra Ahuja. Finding distributed object-centric properties in self-supervised transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026
work page 2026
-
[60]
Dynamic routing between capsules
Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. InAdvances in Neural Information Processing Systems, 2017
work page 2017
-
[61]
Maximilian Seitzer et al. Bridging the gap to real-world object-centric learning.International Conference on Learning Representations, 2023
work page 2023
-
[62]
Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026
work page 2026
-
[63]
Localizing objects with self-supervised transformers and no labels
Oriane Siméoni, Gilles Puy, Huy V V o, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. 2021
work page 2021
-
[64]
Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos.Advances in Neural Information Processing Systems, 27, 2014
work page 2014
-
[65]
Illiterate DALL-E learns to compose
Gautam Singh, Fei Deng, and Sungjin Ahn. Illiterate DALL-E learns to compose. InInternational Conference on Learning Representations, 2022
work page 2022
-
[66]
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015
work page 2015
-
[67]
Rethinking the inception architecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016
work page 2016
-
[68]
Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks for image classification with data-efficient training.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):5314–5321, 2022
work page 2022
-
[69]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InInternational Conference on Machine Learning, pages 10347–10357, 2021
work page 2021
-
[70]
Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
work page 2017
-
[71]
Learning bottleneck concepts in image classification
Bowen Wang, Liangzhi Li, Yuta Nakashima, and Hajime Nagahara. Learning bottleneck concepts in image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10962–10971, 2023
work page 2023
-
[72]
Bowen Wang, Liangzhi Li, Jiahao Zhang, Yuta Nakashima, and Hajime Nagahara. Explainable image recognition via enhanced slot-attention based classifier.arXiv preprint arXiv:2407.05616, 2024
-
[73]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021. 13
work page 2021
-
[74]
Self- supervised transformers for unsupervised object discovery using normalized cut
Yangtao Wang, Xi Shen, Shell Xu Hu, Yuan Yuan, James L Crowley, and Dominique Vaufreydaz. Self- supervised transformers for unsupervised object discovery using normalized cut. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14543–14553, 2022
work page 2022
-
[75]
Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation
Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12275–12284, 2020
work page 2020
-
[76]
CvT: Introducing convolutions to vision transformers
Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. CvT: Introducing convolutions to vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22–31, 2021
work page 2021
-
[77]
Inverted-attention transformers can learn object representations: Insights from slot attention
Yi-Fu Wu, Klaus Greff, Gamaleldin Fathy Elsayed, Michael Curtis Mozer, Thomas Kipf, and Sjoerd van Steenkiste. Inverted-attention transformers can learn object representations: Insights from slot attention. In Causal Representation Learning Workshop at NeurIPS, 2023
work page 2023
-
[78]
Audiovisual slowfast networks for video recognition.arXiv preprint arXiv:2001.08740, 2020
Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. Audiovisual slowfast networks for video recognition.arXiv preprint arXiv:2001.08740, 2020
-
[79]
Multi-class token transformer for weakly supervised semantic segmentation
Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid Boussaid, and Dan Xu. Multi-class token transformer for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4310–4319, 2022
work page 2022
-
[80]
Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, and Tao Mei. Dual vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10870–10882, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.