arxiv: 2504.13181 · v2 · submitted 2025-04-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Perception Encoder: The best visual embeddings are not at the output of the network

Andrea Madotto, Chen Wei, Christoph Feichtenhofer, Daniel Bolya, Daniel Li, Hanoona Rasheed, Hu Xu, Jang Hyun Cho, Jathushan Rajasegaran, Jiale Zhi, Junke Wang, Marco Monteiro, Nikhila Ravi, Peize Sun, Piotr Doll\'ar, Po-Yao Huang, Shiyu Dong, Tengyu Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision encodercontrastive learningvision-language modelsintermediate embeddingsmultimodal alignmentzero-shot classificationdense predictionvideo understanding

0 comments

The pith

The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that scaling a tuned contrastive vision-language pretraining recipe first on images and then on videos produces strong general-purpose embeddings that work across classification, retrieval, question answering, and spatial tasks. The key observation is that these embeddings sit inside the network at intermediate layers instead of appearing at the final output. Two alignment procedures then make the hidden features usable: one for language modeling and one for dense spatial prediction. A sympathetic reader would care because the result suggests a single pretraining approach can replace the usual collection of task-specific objectives in vision.

Core claim

Perception Encoder models are trained solely with contrastive vision-language learning. After scaling the image recipe and refining it with a video data engine, the strongest embeddings for downstream tasks reside in intermediate layers rather than the network output. Language alignment extracts features suited to multimodal language modeling while spatial alignment adapts them for dense prediction, yielding state-of-the-art numbers on zero-shot image and video classification, document and video QA, and tasks such as detection and depth estimation.

What carries the argument

Perception Encoder (PE) family that extracts and aligns intermediate-layer embeddings using language alignment for multimodal tasks and spatial alignment for dense prediction.

If this is right

A single contrastive pretraining recipe produces competitive zero-shot image and video classification and retrieval results.
The same models reach leading numbers on document, image, and video question answering when paired with an 8B language model.
Spatial alignment of intermediate features sets a new state-of-the-art on COCO detection at 66.0 box mAP and supports strong tracking and depth estimation.
One pretraining run covers image, video, and spatial understanding without separate objectives for each.
Releasing the models and a new annotated video dataset enables direct reuse and further scaling experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures could be redesigned to expose or strengthen intermediate representations rather than optimizing only the final layer.
Contrastive objectives may naturally create a feature hierarchy in which general-purpose information appears before task-specific specialization.
Scaling studies for vision models might need to measure performance at multiple depths instead of only the output.
The same intermediate-layer advantage could appear in other modalities if similar contrastive training is applied.

Load-bearing premise

That the intermediate embeddings remain superior to final-layer ones after alignment without any additional task-specific tuning or data selection.

What would settle it

A controlled comparison in which final-output embeddings, after identical language and spatial alignment, match or exceed the reported performance of the intermediate-layer versions on the same zero-shot classification, QA, and detection benchmarks.

read the original abstract

We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together, our PE family of models achieves best-in-class results on a wide variety of tasks, including (1) zero-shot image and video classification and retrieval, simultaneously obtaining 86.6 average zero-shot ImageNet robustness and 76.9 zero-shot Kinetics-400 video classification; (2) document, image, and video Q&A, enabling 94.6 DocVQA, 80.9 InfographicVQA, and 82.7 PerceptionTest with an 8B LLM; and (3) spatial tasks such as detection, tracking, and depth estimation, setting a new COCO state-of-the-art of 66.0 box mAP. To foster further research, we release our models, code, and novel dataset of synthetically and human-annotated videos: https://github.com/facebookresearch/perception_models

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaled contrastive vision-language training puts its strongest embeddings in intermediate layers, which two alignment steps then adapt for SOTA across classification, QA, and dense tasks.

read the letter

The main takeaway is that after scaling a carefully tuned contrastive image pretraining recipe and adding a video data engine, the best visual embeddings sit in the intermediate layers rather than the network output. The authors add language alignment for multimodal LLMs and spatial alignment for dense prediction, then report competitive numbers on zero-shot classification and retrieval, document and video QA, and COCO detection at 66.0 mAP. The release of models, code, and a new video dataset mixing synthetic and human annotations is a straightforward positive for anyone who wants to build on the work. The empirical observation that one contrastive recipe can support such a wide range of tasks once the right layers are accessed is the concrete new piece beyond standard CLIP extensions. The numbers on ImageNet robustness and Kinetics-400 look solid enough to take seriously if the controls check out. The softer spots are the missing ablations and details on layer selection. If the best intermediate layer is picked using downstream validation for each task, or if the alignment steps involve per-task data filtering or schedules, the single general recipe claim loses force. The stress-test concern about hidden task-specific optimization is worth checking directly in the full experiments. The video data engine construction also needs clearer description to judge how much it drives the results. This paper is for researchers working on general vision backbones or multimodal models who are trying to reduce the number of separate pretraining objectives. Someone building on CLIP-style encoders or dense prediction would get practical value from the layer finding and the open assets. It deserves a serious referee because the scale and breadth of tasks are substantial, even if the experimental controls need tightening in review.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Perception Encoder (PE), a vision encoder trained solely via contrastive vision-language learning on scaled image and video data. It claims that the strongest general-purpose embeddings for downstream tasks (zero-shot classification/retrieval, VQA, detection, tracking, depth) reside in intermediate layers rather than the final output; two post-hoc alignment procedures (language alignment for MLLMs, spatial alignment for dense tasks) are introduced to extract them, yielding SOTA numbers such as 86.6 average zero-shot ImageNet robustness, 76.9 zero-shot Kinetics-400, 94.6 DocVQA, 80.9 InfographicVQA, 82.7 PerceptionTest (with 8B LLM), and 66.0 COCO box mAP. Models, code, and a new synthetically/human-annotated video dataset are released.

Significance. If the central empirical claim is substantiated, the work shows that a single, carefully tuned contrastive vision-language recipe can produce versatile embeddings usable across classification, multimodal, and dense tasks after simple alignment steps, reducing the need for task-specific pretraining objectives. The breadth of reported results and public release of models plus data would make this a useful reference point for the community.

major comments (3)

[Abstract, §4] Abstract and §4 (Experiments): The reported SOTA numbers (e.g., 66.0 COCO mAP, 86.6 ImageNet robustness) are presented without ablations, baseline comparisons, error bars, or details on the video data engine construction. This leaves the generality claim resting on unshown controls for data curation, layer selection, and alignment hyperparameters.
[§3] §3 (Alignment methods): It is unclear whether the language and spatial alignment procedures use a single fixed recipe or involve per-task data filtering, learning-rate schedules, or validation-based layer selection. If the best intermediate layer is chosen via downstream performance, the 'single general pretraining recipe' claim is undermined.
[§4.2–4.3] §4.2–4.3 (Downstream results): No comparison is shown to the final-layer embeddings of the same model or to standard contrastive baselines (e.g., CLIP, SigLIP) under identical alignment procedures, making it impossible to isolate the contribution of the intermediate-layer observation versus the alignment steps themselves.

minor comments (2)

[§3] Notation for the two alignment losses is introduced without explicit equations; adding them would clarify how they differ from the original contrastive objective.
[§4] Figure captions and tables lack details on the exact layer indices used for each task; a single table summarizing best-layer indices across benchmarks would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to add requested ablations, clarifications, and comparisons.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): The reported SOTA numbers (e.g., 66.0 COCO mAP, 86.6 ImageNet robustness) are presented without ablations, baseline comparisons, error bars, or details on the video data engine construction. This leaves the generality claim resting on unshown controls for data curation, layer selection, and alignment hyperparameters.

Authors: We agree that additional controls strengthen the claims. The revised §4 now includes ablations on data curation, layer selection, and alignment hyperparameters, plus expanded details on the video data engine. Error bars are reported for key results; where omitted, this is due to compute limits and is noted. Baseline comparisons have been made more explicit. revision: yes
Referee: [§3] §3 (Alignment methods): It is unclear whether the language and spatial alignment procedures use a single fixed recipe or involve per-task data filtering, learning-rate schedules, or validation-based layer selection. If the best intermediate layer is chosen via downstream performance, the 'single general pretraining recipe' claim is undermined.

Authors: The procedures use a single fixed recipe with shared hyperparameters. Layer selection follows a general validation protocol (not per-task optimization or custom filtering/schedules). We have revised §3 to state this explicitly, confirming the pretraining remains general. revision: partial
Referee: [§4.2–4.3] §4.2–4.3 (Downstream results): No comparison is shown to the final-layer embeddings of the same model or to standard contrastive baselines (e.g., CLIP, SigLIP) under identical alignment procedures, making it impossible to isolate the contribution of the intermediate-layer observation versus the alignment steps themselves.

Authors: We have added comparisons to final-layer embeddings under identical alignment in revised §4.2–4.3. We also report CLIP and SigLIP baselines aligned identically, isolating the intermediate-layer contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on empirical scaling results from standard contrastive vision-language pretraining, followed by two explicit alignment procedures (language and spatial) whose outputs are evaluated on downstream benchmarks. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce reported performance to inputs by construction. The intermediate-layer superiority is stated as an observed outcome after training, not a definitional or self-referential result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a carefully tuned image pretraining recipe and a robust video data engine whose construction details are not provided in the abstract; no explicit free parameters or invented entities are named.

axioms (1)

domain assumption Contrastive vision-language loss produces transferable embeddings when scaled
Invoked in the abstract as the basis for the single-recipe claim

pith-pipeline@v0.9.0 · 5652 in / 1263 out tokens · 39547 ms · 2026-05-13T22:16:59.879451+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What Cohort INRs Encode and Where to Freeze Them
cs.LG 2026-05 unverdicted novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
cs.CV 2026-04 unverdicted novelty 7.0

Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garm...
Qwen-Image-VAE-2.0 Technical Report
cs.CV 2026-05 unverdicted novelty 6.0

Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
Cross-Attentive Multiview Fusion of Vision-Language Embeddings
cs.CV 2026-04 unverdicted novelty 6.0

CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching S...
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
eess.AS 2026-04 unverdicted novelty 6.0

A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
cs.CV 2026-04 unverdicted novelty 6.0

TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding
cs.CV 2026-04 unverdicted novelty 6.0

UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
cs.CV 2026-04 unverdicted novelty 6.0

VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks
cs.CV 2026-04 conditional novelty 6.0

VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models...
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
cs.LG 2025-06 unverdicted novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Enabling High Error Tolerance in Satellite Video Transmissions by Generative Semantic Communication
eess.SP 2026-04 unverdicted novelty 5.0

A generative semantic communication method for satellite video achieves 2.5 dB higher PSNR than conventional semantic comms at 45% error rate and remains functional above 80% error by combining semantic encoding with ...
Sapiens2
cs.CV 2026-04 unverdicted novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
cs.LG 2026-04 unverdicted novelty 5.0

Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
cs.CV 2026-04 unverdicted novelty 5.0

Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.
Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable
cs.CV 2026-05 unverdicted novelty 4.0

Life-logging video streams create an inevitable privacy-utility trade-off that is a foundational challenge for always-on AI systems.
CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

CoCo-SAM3 improves SAM3 by aligning evidence from synonymous prompts for concept consistency and then running inter-class competition on a unified scale to reduce mask overlaps.
Adaptive Forensic Feature Refinement via Intrinsic Importance Perception
cs.CV 2026-04 unverdicted novelty 4.0

I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harmi...

Reference graph

Works this paper leans on

169 extracted references · 169 canonical work pages · cited by 22 Pith papers · 19 internal anchors

[1]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InICCV, 2019. 14, 15, 16, 32

work page 2019
[2]

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, Albert Q. Jiang, Kartik Khandelwal, Timothée Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall,...

work page internal anchor Pith review arXiv 2024
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv:2308.12966, 2023. 20

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In NeurIPS, 2019. 3, 4, 6, 8, 9, 10, 30, 31, 32

work page 2019
[5]

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey A. Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Soft-NMS–Improving object detection with one line of code

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–Improving object detection with one line of code. InICCV, 2017. 30

work page 2017
[7]

Window attention is bugged: how not to interpolate position embeddings

Daniel Bolya, Chaitanya Ryali, Judy Hoffman, and Christoph Feichtenhofer. Window attention is bugged: how not to interpolate position embeddings. InICLR, 2023. 11, 29

work page 2023
[8]

Guillotine regulariza- tion: Why removing layers is needed to improve generalization in self-supervised learning.arXiv:2206.13378,

Florian Bordes, Randall Balestriero, Quentin Garrido, Adrien Bardes, and Pascal Vincent. Guillotine regulariza- tion: Why removing layers is needed to improve generalization in self-supervised learning.arXiv:2206.13378,

work page arXiv
[9]

Food-101 – Mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – Mining discriminative components with random forests. In ECCV, 2014. 9

work page 2014
[10]

The OpenCV library.Dr

Gary Bradski. The OpenCV library.Dr. Dobb’s Journal: Software Tools for the Professional Programmer , 2000. 22

work page 2000
[11]

Cascade R-CNN: Delving into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. InCVPR,

work page
[12]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InECCV, 2020. 19

work page 2020
[13]

Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, and Christopher D. Manning. AuroraCap: Efficient, performant video detailed captioning and a new benchmark. In ICLR, 2025. 5

work page 2025
[14]

Hybrid task cascade for instance segmentation

Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Hybrid task cascade for instance segmentation. InCVPR,

work page
[15]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. InICML, 2020. 20

work page 2020
[16]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 20 35

work page 2020
[17]

Pali: A jointly-scaled multilingual language-image model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...

work page 2023
[18]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jia...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024. 1, 6, 7, 9, 10, 20, 26

work page 2024
[20]

Remote sensing image scene classification: Benchmark and state of the art

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 2017. 9

work page 2017
[21]

arXiv preprint arXiv:2504.13180 , year=

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Suyog Jain, Miguel Martin, Huiyu Wang, Nikhila Ravi, Shashank Jain, Temmy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp ...

work page arXiv 2025
[22]

CAT-Seg: Cost aggregation for open-vocabulary semantic segmentation

Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. CAT-Seg: Cost aggregation for open-vocabulary semantic segmentation. InCVPR, 2024. 20

work page 2024
[23]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR, 2024. 12, 17

work page 2024
[24]

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F....

work page 2023
[25]

S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

work page arXiv 2024
[26]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 1, 3, 6, 8, 9, 10, 30, 31, 32

work page 2009
[27]

VirTex: Learning visual representations from textual annotations

Karan Desai and Justin Johnson. VirTex: Learning visual representations from textual annotations. InCVPR,

work page
[28]

Decoupling zero-shot semantic segmentation

Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. Decoupling zero-shot semantic segmentation. InCVPR,

work page
[29]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2020. 1, 8, 9 36

work page 2020
[30]

Scalable pre-training of large autoregressive image models

Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autoregressive image models. InICML,

work page
[31]

Scaling language-free visual representation learning

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nico- las Ballas, Yann LeCun, Amir Bar, and Saining Xie. Scaling language-free visual representation learning. arXiv:2504.01017, 2025. 12, 13

work page arXiv 2025
[32]

Improving CLIP training with language rewrites

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving CLIP training with language rewrites. In NeurIPS, 2023. 20

work page 2023
[33]

Data filtering networks

Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. InICLR, 2024. 1, 3, 9, 16, 20, 26

work page 2024
[34]

EVA: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA: Exploring the limits of masked visual representation learning at scale. InCVPR, 2023. 1

work page 2023
[35]

EVA-02: A visual representation for neon genesis.Image and Vision Computing , 2024

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA-02: A visual representation for neon genesis.Image and Vision Computing , 2024. 1, 19

work page 2024
[36]

X3D: Expanding architectures for efficient video recognition

Christoph Feichtenhofer. X3D: Expanding architectures for efficient video recognition. InCVPR, 2020. 4

work page 2020
[37]

Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M

Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T. Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, and Alaaeldin El-Nouby. Multimodal autoregressive pre-training of large vision encoders. In CVPR, 2025. 1, 2, 10, 1...

work page 2025
[38]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...

work page 2023
[40]

Making the v in VQA matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in VQA matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017. 14, 15, 16, 32

work page 2017
[41]

LVIS: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019. 19

work page 2019
[42]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 1

work page 2016
[43]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. InICCV, 2017. 11, 12, 19, 29

work page 2017
[44]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 1, 19

work page 2022
[45]

RADIOv2.5: Improved baselines for agglomerative vision foundation models

Greg Heinrich, Mike Ranzinger, Hongxu, Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. RADIOv2.5: Improved baselines for agglomerative vision foundation models. InCVPR, 2025. 1, 10, 18

work page 2025
[46]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, 2021. 3, 8, 9, 30, 31

work page 2021
[47]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021. 3, 4, 8, 9, 30, 31, 32

work page 2021
[48]

Rotary position embedding for vision transformer

Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In ECCV, 2024. 20 37

work page 2024
[49]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNeurIPS Deep Learning Workshop, 2015. 8

work page 2015
[50]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, 2016. 14, 17

work page 2016
[51]

OpenCLIP, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, 2021. 3, 20

work page 2021
[52]

Space-time correspondence as a contrastive random walk

Allan Jabri, Andrew Owens, and Alexei Efros. Space-time correspondence as a contrastive random walk. In NeurIPS, 2020. 11, 19, 29

work page 2020
[53]

TGIF-QA: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. InCVPR, 2017. 14, 15, 16, 32

work page 2017
[54]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 1, 20

work page 2021
[55]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. arXiv:1705.06950, 2017. 6, 9, 31, 32

work page internal anchor Pith review Pith/arXiv arXiv 2017
[56]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InEMNLP, 2014. 14, 15, 16, 32, 33

work page 2014
[57]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. 14, 15, 16, 32

work page 2016
[58]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InICCV,

work page
[59]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV Workshop, 2013. 9

work page 2013
[60]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 2017. 27, 32

work page 2017
[61]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012. 1

work page 2012
[62]

HMDB: a large video database for human motion recognition

Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. HMDB: a large video database for human motion recognition. InICCV, 2011. 9, 31, 32

work page 2011
[63]

Weicheng Kuo, Yin Cui, Xiuye Gu, A. J. Piergiovanni, and Anelia Angelova. F-VLM: open-vocabulary object detection upon frozen vision and language models. InICLR, 2023. 20

work page 2023
[64]

VeCLIP: Improving CLIP training via visual-enriched captions

Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, and Meng Cao. VeCLIP: Improving CLIP training via visual-enriched captions. InECCV, 2024. 5, 20

work page 2024
[65]

What matters when building vision-language models? In NeurIPS, 2024

Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? In NeurIPS, 2024. 27

work page 2024
[66]

LLaVA-OneVision: Easy visual task transfer.TMLR, 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-OneVision: Easy visual task transfer.TMLR, 2025. 16, 20, 22

work page 2025
[67]

Unmasked teacher: Towards training-efficient video foundation models

Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. InICCV, 2023. 9

work page 2023
[68]

MVBench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A comprehensive multi-modal video understanding benchmark. InCVPR,

work page
[69]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In CVPR, 2022. 1

work page 2022
[70]

An inverse scaling law for CLIP training

Xianhang Li, Zeyu Wang, and Cihang Xie. An inverse scaling law for CLIP training. InNeurIPS, 2023. 3

work page 2023
[71]

CLIPA-v2: Scaling CLIP training with 81.1% zero-shot imagenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy.arXiv:2306.15658, 2023

Xianhang Li, Zeyu Wang, and Cihang Xie. CLIPA-v2: Scaling CLIP training with 81.1% zero-shot imagenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy.arXiv:2306.15658, 2023. 3, 20

work page arXiv 2023
[72]

Exploring plain vision transformer backbones for object detection

Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In ECCV, 2022. 11, 19, 29

work page 2022
[73]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023. 14, 15, 16, 32

work page 2023
[74]

Scaling language-image pre-training via masking

Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. InCVPR, 2023. 20

work page 2023
[75]

Binsformer: Revisiting adaptive bins for monocular depth estimation

Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. TIP, 2024. 29

work page 2024
[76]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 2, 6, 9, 12, 14, 15, 16, 19, 27, 31, 32

work page 2014
[77]

LLaVA-NeXT: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaVA-NeXT: Improved reasoning, ocr, and world knowledge, 2024. 32, 33

work page 2024
[78]

Visual instruction tuning.NeurIPS, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2024. 20, 23

work page 2024
[79]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, 2021. 3, 19

work page 2021
[80]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer v2: Scaling up capacity and resolution. InCVPR, 2022. 19

work page 2022

Showing first 80 references.