The What-Where Transformer achieves explicit what-where separation in a ViT-style backbone via concurrent token and attention-map streams, yielding emergent object discovery from attention maps and better weakly-supervised localization.
Object-centric slot diffusion.Advances in Neural Information Processing Systems
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
A deep sprite-based image decomposition method matches SOTA unsupervised class-aware segmentation on CLEVR, scales linearly with objects, explicitly identifies categories, and fully models images interpretably.
citing papers explorer
-
What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization
The What-Where Transformer achieves explicit what-where separation in a ViT-style backbone via concurrent token and attention-map streams, yielding emergent object discovery from attention maps and better weakly-supervised localization.
-
Deep sprite-based image models: An analysis
A deep sprite-based image decomposition method matches SOTA unsupervised class-aware segmentation on CLEVR, scales linearly with objects, explicitly identifies categories, and fully models images interpretably.