pith. machine review for the scientific record. sign in

arxiv: 2504.13181 · v2 · submitted 2025-04-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Perception Encoder: The best visual embeddings are not at the output of the network

Andrea Madotto, Chen Wei, Christoph Feichtenhofer, Daniel Bolya, Daniel Li, Hanoona Rasheed, Hu Xu, Jang Hyun Cho, Jathushan Rajasegaran, Jiale Zhi, Junke Wang, Marco Monteiro, Nikhila Ravi, Peize Sun, Piotr Doll\'ar, Po-Yao Huang, Shiyu Dong, Tengyu Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision encodercontrastive learningvision-language modelsintermediate embeddingsmultimodal alignmentzero-shot classificationdense predictionvideo understanding
0
0 comments X

The pith

The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that scaling a tuned contrastive vision-language pretraining recipe first on images and then on videos produces strong general-purpose embeddings that work across classification, retrieval, question answering, and spatial tasks. The key observation is that these embeddings sit inside the network at intermediate layers instead of appearing at the final output. Two alignment procedures then make the hidden features usable: one for language modeling and one for dense spatial prediction. A sympathetic reader would care because the result suggests a single pretraining approach can replace the usual collection of task-specific objectives in vision.

Core claim

Perception Encoder models are trained solely with contrastive vision-language learning. After scaling the image recipe and refining it with a video data engine, the strongest embeddings for downstream tasks reside in intermediate layers rather than the network output. Language alignment extracts features suited to multimodal language modeling while spatial alignment adapts them for dense prediction, yielding state-of-the-art numbers on zero-shot image and video classification, document and video QA, and tasks such as detection and depth estimation.

What carries the argument

Perception Encoder (PE) family that extracts and aligns intermediate-layer embeddings using language alignment for multimodal tasks and spatial alignment for dense prediction.

If this is right

  • A single contrastive pretraining recipe produces competitive zero-shot image and video classification and retrieval results.
  • The same models reach leading numbers on document, image, and video question answering when paired with an 8B language model.
  • Spatial alignment of intermediate features sets a new state-of-the-art on COCO detection at 66.0 box mAP and supports strong tracking and depth estimation.
  • One pretraining run covers image, video, and spatial understanding without separate objectives for each.
  • Releasing the models and a new annotated video dataset enables direct reuse and further scaling experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures could be redesigned to expose or strengthen intermediate representations rather than optimizing only the final layer.
  • Contrastive objectives may naturally create a feature hierarchy in which general-purpose information appears before task-specific specialization.
  • Scaling studies for vision models might need to measure performance at multiple depths instead of only the output.
  • The same intermediate-layer advantage could appear in other modalities if similar contrastive training is applied.

Load-bearing premise

That the intermediate embeddings remain superior to final-layer ones after alignment without any additional task-specific tuning or data selection.

What would settle it

A controlled comparison in which final-output embeddings, after identical language and spatial alignment, match or exceed the reported performance of the intermediate-layer versions on the same zero-shot classification, QA, and detection benchmarks.

read the original abstract

We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together, our PE family of models achieves best-in-class results on a wide variety of tasks, including (1) zero-shot image and video classification and retrieval, simultaneously obtaining 86.6 average zero-shot ImageNet robustness and 76.9 zero-shot Kinetics-400 video classification; (2) document, image, and video Q&A, enabling 94.6 DocVQA, 80.9 InfographicVQA, and 82.7 PerceptionTest with an 8B LLM; and (3) spatial tasks such as detection, tracking, and depth estimation, setting a new COCO state-of-the-art of 66.0 box mAP. To foster further research, we release our models, code, and novel dataset of synthetically and human-annotated videos: https://github.com/facebookresearch/perception_models

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Perception Encoder (PE), a vision encoder trained solely via contrastive vision-language learning on scaled image and video data. It claims that the strongest general-purpose embeddings for downstream tasks (zero-shot classification/retrieval, VQA, detection, tracking, depth) reside in intermediate layers rather than the final output; two post-hoc alignment procedures (language alignment for MLLMs, spatial alignment for dense tasks) are introduced to extract them, yielding SOTA numbers such as 86.6 average zero-shot ImageNet robustness, 76.9 zero-shot Kinetics-400, 94.6 DocVQA, 80.9 InfographicVQA, 82.7 PerceptionTest (with 8B LLM), and 66.0 COCO box mAP. Models, code, and a new synthetically/human-annotated video dataset are released.

Significance. If the central empirical claim is substantiated, the work shows that a single, carefully tuned contrastive vision-language recipe can produce versatile embeddings usable across classification, multimodal, and dense tasks after simple alignment steps, reducing the need for task-specific pretraining objectives. The breadth of reported results and public release of models plus data would make this a useful reference point for the community.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (Experiments): The reported SOTA numbers (e.g., 66.0 COCO mAP, 86.6 ImageNet robustness) are presented without ablations, baseline comparisons, error bars, or details on the video data engine construction. This leaves the generality claim resting on unshown controls for data curation, layer selection, and alignment hyperparameters.
  2. [§3] §3 (Alignment methods): It is unclear whether the language and spatial alignment procedures use a single fixed recipe or involve per-task data filtering, learning-rate schedules, or validation-based layer selection. If the best intermediate layer is chosen via downstream performance, the 'single general pretraining recipe' claim is undermined.
  3. [§4.2–4.3] §4.2–4.3 (Downstream results): No comparison is shown to the final-layer embeddings of the same model or to standard contrastive baselines (e.g., CLIP, SigLIP) under identical alignment procedures, making it impossible to isolate the contribution of the intermediate-layer observation versus the alignment steps themselves.
minor comments (2)
  1. [§3] Notation for the two alignment losses is introduced without explicit equations; adding them would clarify how they differ from the original contrastive objective.
  2. [§4] Figure captions and tables lack details on the exact layer indices used for each task; a single table summarizing best-layer indices across benchmarks would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to add requested ablations, clarifications, and comparisons.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): The reported SOTA numbers (e.g., 66.0 COCO mAP, 86.6 ImageNet robustness) are presented without ablations, baseline comparisons, error bars, or details on the video data engine construction. This leaves the generality claim resting on unshown controls for data curation, layer selection, and alignment hyperparameters.

    Authors: We agree that additional controls strengthen the claims. The revised §4 now includes ablations on data curation, layer selection, and alignment hyperparameters, plus expanded details on the video data engine. Error bars are reported for key results; where omitted, this is due to compute limits and is noted. Baseline comparisons have been made more explicit. revision: yes

  2. Referee: [§3] §3 (Alignment methods): It is unclear whether the language and spatial alignment procedures use a single fixed recipe or involve per-task data filtering, learning-rate schedules, or validation-based layer selection. If the best intermediate layer is chosen via downstream performance, the 'single general pretraining recipe' claim is undermined.

    Authors: The procedures use a single fixed recipe with shared hyperparameters. Layer selection follows a general validation protocol (not per-task optimization or custom filtering/schedules). We have revised §3 to state this explicitly, confirming the pretraining remains general. revision: partial

  3. Referee: [§4.2–4.3] §4.2–4.3 (Downstream results): No comparison is shown to the final-layer embeddings of the same model or to standard contrastive baselines (e.g., CLIP, SigLIP) under identical alignment procedures, making it impossible to isolate the contribution of the intermediate-layer observation versus the alignment steps themselves.

    Authors: We have added comparisons to final-layer embeddings under identical alignment in revised §4.2–4.3. We also report CLIP and SigLIP baselines aligned identically, isolating the intermediate-layer contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on empirical scaling results from standard contrastive vision-language pretraining, followed by two explicit alignment procedures (language and spatial) whose outputs are evaluated on downstream benchmarks. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce reported performance to inputs by construction. The intermediate-layer superiority is stated as an observed outcome after training, not a definitional or self-referential result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a carefully tuned image pretraining recipe and a robust video data engine whose construction details are not provided in the abstract; no explicit free parameters or invented entities are named.

axioms (1)
  • domain assumption Contrastive vision-language loss produces transferable embeddings when scaled
    Invoked in the abstract as the basis for the single-recipe claim

pith-pipeline@v0.9.0 · 5652 in / 1263 out tokens · 39547 ms · 2026-05-13T22:16:59.879451+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. What Cohort INRs Encode and Where to Freeze Them

    cs.LG 2026-05 unverdicted novelty 7.0

    Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

  2. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  3. Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

    cs.CV 2026-04 unverdicted novelty 7.0

    Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garm...

  4. Qwen-Image-VAE-2.0 Technical Report

    cs.CV 2026-05 unverdicted novelty 6.0

    Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.

  5. Cross-Attentive Multiview Fusion of Vision-Language Embeddings

    cs.CV 2026-04 unverdicted novelty 6.0

    CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching S...

  6. Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

    eess.AS 2026-04 unverdicted novelty 6.0

    A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...

  7. TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.

  8. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  9. UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

    cs.CV 2026-04 unverdicted novelty 6.0

    UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.

  10. Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

  11. Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.

  12. VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

    cs.CV 2026-04 unverdicted novelty 6.0

    VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.

  13. Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks

    cs.CV 2026-04 conditional novelty 6.0

    VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models...

  14. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  15. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  16. Enabling High Error Tolerance in Satellite Video Transmissions by Generative Semantic Communication

    eess.SP 2026-04 unverdicted novelty 5.0

    A generative semantic communication method for satellite video achieves 2.5 dB higher PSNR than conventional semantic comms at 45% error rate and remains functional above 80% error by combining semantic encoding with ...

  17. Sapiens2

    cs.CV 2026-04 unverdicted novelty 5.0

    Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...

  18. Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

    cs.LG 2026-04 unverdicted novelty 5.0

    Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.

  19. Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

    cs.CV 2026-04 unverdicted novelty 5.0

    Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.

  20. Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable

    cs.CV 2026-05 unverdicted novelty 4.0

    Life-logging video streams create an inevitable privacy-utility trade-off that is a foundational challenge for always-on AI systems.

  21. CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation

    cs.CV 2026-04 unverdicted novelty 4.0

    CoCo-SAM3 improves SAM3 by aligning evidence from synonymous prompts for concept consistency and then running inter-class competition on a unified scale to reduce mask overlaps.

  22. Adaptive Forensic Feature Refinement via Intrinsic Importance Perception

    cs.CV 2026-04 unverdicted novelty 4.0

    I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harmi...

Reference graph

Works this paper leans on

169 extracted references · 169 canonical work pages · cited by 22 Pith papers · 19 internal anchors

  1. [1]

    Nocaps: Novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InICCV, 2019. 14, 15, 16, 32

  2. [2]

    Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, Albert Q. Jiang, Kartik Khandelwal, Timothée Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall,...

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv:2308.12966, 2023. 20

  4. [4]

    ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

    Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In NeurIPS, 2019. 3, 4, 6, 8, 9, 10, 30, 31, 32

  5. [5]

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey A. Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias ...

  6. [6]

    Soft-NMS–Improving object detection with one line of code

    Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–Improving object detection with one line of code. InICCV, 2017. 30

  7. [7]

    Window attention is bugged: how not to interpolate position embeddings

    Daniel Bolya, Chaitanya Ryali, Judy Hoffman, and Christoph Feichtenhofer. Window attention is bugged: how not to interpolate position embeddings. InICLR, 2023. 11, 29

  8. [8]

    Guillotine regulariza- tion: Why removing layers is needed to improve generalization in self-supervised learning.arXiv:2206.13378,

    Florian Bordes, Randall Balestriero, Quentin Garrido, Adrien Bardes, and Pascal Vincent. Guillotine regulariza- tion: Why removing layers is needed to improve generalization in self-supervised learning.arXiv:2206.13378,

  9. [9]

    Food-101 – Mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – Mining discriminative components with random forests. In ECCV, 2014. 9

  10. [10]

    The OpenCV library.Dr

    Gary Bradski. The OpenCV library.Dr. Dobb’s Journal: Software Tools for the Professional Programmer , 2000. 22

  11. [11]

    Cascade R-CNN: Delving into high quality object detection

    Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. InCVPR,

  12. [12]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InECCV, 2020. 19

  13. [13]

    Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, and Christopher D. Manning. AuroraCap: Efficient, performant video detailed captioning and a new benchmark. In ICLR, 2025. 5

  14. [14]

    Hybrid task cascade for instance segmentation

    Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Hybrid task cascade for instance segmentation. InCVPR,

  15. [15]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. InICML, 2020. 20

  16. [16]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 20 35

  17. [17]

    Pali: A jointly-scaled multilingual language-image model

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...

  18. [18]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jia...

  19. [19]

    InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024. 1, 6, 7, 9, 10, 20, 26

  20. [20]

    Remote sensing image scene classification: Benchmark and state of the art

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 2017. 9

  21. [21]

    arXiv preprint arXiv:2504.13180 , year=

    Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Suyog Jain, Miguel Martin, Huiyu Wang, Nikhila Ravi, Shashank Jain, Temmy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp ...

  22. [22]

    CAT-Seg: Cost aggregation for open-vocabulary semantic segmentation

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. CAT-Seg: Cost aggregation for open-vocabulary semantic segmentation. InCVPR, 2024. 20

  23. [23]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR, 2024. 12, 17

  24. [24]

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F....

  25. [25]

    S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

  26. [26]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 1, 3, 6, 8, 9, 10, 30, 31, 32

  27. [27]

    VirTex: Learning visual representations from textual annotations

    Karan Desai and Justin Johnson. VirTex: Learning visual representations from textual annotations. InCVPR,

  28. [28]

    Decoupling zero-shot semantic segmentation

    Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. Decoupling zero-shot semantic segmentation. InCVPR,

  29. [29]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2020. 1, 8, 9 36

  30. [30]

    Scalable pre-training of large autoregressive image models

    Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autoregressive image models. InICML,

  31. [31]

    Scaling language-free visual representation learning

    David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nico- las Ballas, Yann LeCun, Amir Bar, and Saining Xie. Scaling language-free visual representation learning. arXiv:2504.01017, 2025. 12, 13

  32. [32]

    Improving CLIP training with language rewrites

    Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving CLIP training with language rewrites. In NeurIPS, 2023. 20

  33. [33]

    Data filtering networks

    Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. InICLR, 2024. 1, 3, 9, 16, 20, 26

  34. [34]

    EVA: Exploring the limits of masked visual representation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA: Exploring the limits of masked visual representation learning at scale. InCVPR, 2023. 1

  35. [35]

    EVA-02: A visual representation for neon genesis.Image and Vision Computing , 2024

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA-02: A visual representation for neon genesis.Image and Vision Computing , 2024. 1, 19

  36. [36]

    X3D: Expanding architectures for efficient video recognition

    Christoph Feichtenhofer. X3D: Expanding architectures for efficient video recognition. InCVPR, 2020. 4

  37. [37]

    Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M

    Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T. Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, and Alaaeldin El-Nouby. Multimodal autoregressive pre-training of large vision encoders. In CVPR, 2025. 1, 2, 10, 1...

  38. [38]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21...

  39. [39]

    DataComp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...

  40. [40]

    Making the v in VQA matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in VQA matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017. 14, 15, 16, 32

  41. [41]

    LVIS: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019. 19

  42. [42]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 1

  43. [43]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. InICCV, 2017. 11, 12, 19, 29

  44. [44]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 1, 19

  45. [45]

    RADIOv2.5: Improved baselines for agglomerative vision foundation models

    Greg Heinrich, Mike Ranzinger, Hongxu, Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. RADIOv2.5: Improved baselines for agglomerative vision foundation models. InCVPR, 2025. 1, 10, 18

  46. [46]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, 2021. 3, 8, 9, 30, 31

  47. [47]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021. 3, 4, 8, 9, 30, 31, 32

  48. [48]

    Rotary position embedding for vision transformer

    Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In ECCV, 2024. 20 37

  49. [49]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNeurIPS Deep Learning Workshop, 2015. 8

  50. [50]

    Deep networks with stochastic depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, 2016. 14, 17

  51. [51]

    OpenCLIP, 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, 2021. 3, 20

  52. [52]

    Space-time correspondence as a contrastive random walk

    Allan Jabri, Andrew Owens, and Alexei Efros. Space-time correspondence as a contrastive random walk. In NeurIPS, 2020. 11, 19, 29

  53. [53]

    TGIF-QA: Toward spatio-temporal reasoning in visual question answering

    Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. InCVPR, 2017. 14, 15, 16, 32

  54. [54]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 1, 20

  55. [55]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. arXiv:1705.06950, 2017. 6, 9, 31, 32

  56. [56]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InEMNLP, 2014. 14, 15, 16, 32, 33

  57. [57]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. 14, 15, 16, 32

  58. [58]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InICCV,

  59. [59]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV Workshop, 2013. 9

  60. [60]

    Shamma, Michael S

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 2017. 27, 32

  61. [61]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012. 1

  62. [62]

    HMDB: a large video database for human motion recognition

    Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. HMDB: a large video database for human motion recognition. InICCV, 2011. 9, 31, 32

  63. [63]

    Weicheng Kuo, Yin Cui, Xiuye Gu, A. J. Piergiovanni, and Anelia Angelova. F-VLM: open-vocabulary object detection upon frozen vision and language models. InICLR, 2023. 20

  64. [64]

    VeCLIP: Improving CLIP training via visual-enriched captions

    Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, and Meng Cao. VeCLIP: Improving CLIP training via visual-enriched captions. InECCV, 2024. 5, 20

  65. [65]

    What matters when building vision-language models? In NeurIPS, 2024

    Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? In NeurIPS, 2024. 27

  66. [66]

    LLaVA-OneVision: Easy visual task transfer.TMLR, 2025

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-OneVision: Easy visual task transfer.TMLR, 2025. 16, 20, 22

  67. [67]

    Unmasked teacher: Towards training-efficient video foundation models

    Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. InICCV, 2023. 9

  68. [68]

    MVBench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A comprehensive multi-modal video understanding benchmark. InCVPR,

  69. [69]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In CVPR, 2022. 1

  70. [70]

    An inverse scaling law for CLIP training

    Xianhang Li, Zeyu Wang, and Cihang Xie. An inverse scaling law for CLIP training. InNeurIPS, 2023. 3

  71. [71]

    CLIPA-v2: Scaling CLIP training with 81.1% zero-shot imagenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy.arXiv:2306.15658, 2023

    Xianhang Li, Zeyu Wang, and Cihang Xie. CLIPA-v2: Scaling CLIP training with 81.1% zero-shot imagenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy.arXiv:2306.15658, 2023. 3, 20

  72. [72]

    Exploring plain vision transformer backbones for object detection

    Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In ECCV, 2022. 11, 19, 29

  73. [73]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023. 14, 15, 16, 32

  74. [74]

    Scaling language-image pre-training via masking

    Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. InCVPR, 2023. 20

  75. [75]

    Binsformer: Revisiting adaptive bins for monocular depth estimation

    Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. TIP, 2024. 29

  76. [76]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 2, 6, 9, 12, 14, 15, 16, 19, 27, 31, 32

  77. [77]

    LLaVA-NeXT: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaVA-NeXT: Improved reasoning, ocr, and world knowledge, 2024. 32, 33

  78. [78]

    Visual instruction tuning.NeurIPS, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2024. 20, 23

  79. [79]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, 2021. 3, 19

  80. [80]

    Swin transformer v2: Scaling up capacity and resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer v2: Scaling up capacity and resolution. InCVPR, 2022. 19

Showing first 80 references.