Recognition: 2 theorem links
· Lean TheoremPerception Encoder: The best visual embeddings are not at the output of the network
Pith reviewed 2026-05-13 22:16 UTC · model grok-4.3
The pith
The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Perception Encoder models are trained solely with contrastive vision-language learning. After scaling the image recipe and refining it with a video data engine, the strongest embeddings for downstream tasks reside in intermediate layers rather than the network output. Language alignment extracts features suited to multimodal language modeling while spatial alignment adapts them for dense prediction, yielding state-of-the-art numbers on zero-shot image and video classification, document and video QA, and tasks such as detection and depth estimation.
What carries the argument
Perception Encoder (PE) family that extracts and aligns intermediate-layer embeddings using language alignment for multimodal tasks and spatial alignment for dense prediction.
If this is right
- A single contrastive pretraining recipe produces competitive zero-shot image and video classification and retrieval results.
- The same models reach leading numbers on document, image, and video question answering when paired with an 8B language model.
- Spatial alignment of intermediate features sets a new state-of-the-art on COCO detection at 66.0 box mAP and supports strong tracking and depth estimation.
- One pretraining run covers image, video, and spatial understanding without separate objectives for each.
- Releasing the models and a new annotated video dataset enables direct reuse and further scaling experiments.
Where Pith is reading between the lines
- Architectures could be redesigned to expose or strengthen intermediate representations rather than optimizing only the final layer.
- Contrastive objectives may naturally create a feature hierarchy in which general-purpose information appears before task-specific specialization.
- Scaling studies for vision models might need to measure performance at multiple depths instead of only the output.
- The same intermediate-layer advantage could appear in other modalities if similar contrastive training is applied.
Load-bearing premise
That the intermediate embeddings remain superior to final-layer ones after alignment without any additional task-specific tuning or data selection.
What would settle it
A controlled comparison in which final-output embeddings, after identical language and spatial alignment, match or exceed the reported performance of the intermediate-layer versions on the same zero-shot classification, QA, and detection benchmarks.
read the original abstract
We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together, our PE family of models achieves best-in-class results on a wide variety of tasks, including (1) zero-shot image and video classification and retrieval, simultaneously obtaining 86.6 average zero-shot ImageNet robustness and 76.9 zero-shot Kinetics-400 video classification; (2) document, image, and video Q&A, enabling 94.6 DocVQA, 80.9 InfographicVQA, and 82.7 PerceptionTest with an 8B LLM; and (3) spatial tasks such as detection, tracking, and depth estimation, setting a new COCO state-of-the-art of 66.0 box mAP. To foster further research, we release our models, code, and novel dataset of synthetically and human-annotated videos: https://github.com/facebookresearch/perception_models
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Perception Encoder (PE), a vision encoder trained solely via contrastive vision-language learning on scaled image and video data. It claims that the strongest general-purpose embeddings for downstream tasks (zero-shot classification/retrieval, VQA, detection, tracking, depth) reside in intermediate layers rather than the final output; two post-hoc alignment procedures (language alignment for MLLMs, spatial alignment for dense tasks) are introduced to extract them, yielding SOTA numbers such as 86.6 average zero-shot ImageNet robustness, 76.9 zero-shot Kinetics-400, 94.6 DocVQA, 80.9 InfographicVQA, 82.7 PerceptionTest (with 8B LLM), and 66.0 COCO box mAP. Models, code, and a new synthetically/human-annotated video dataset are released.
Significance. If the central empirical claim is substantiated, the work shows that a single, carefully tuned contrastive vision-language recipe can produce versatile embeddings usable across classification, multimodal, and dense tasks after simple alignment steps, reducing the need for task-specific pretraining objectives. The breadth of reported results and public release of models plus data would make this a useful reference point for the community.
major comments (3)
- [Abstract, §4] Abstract and §4 (Experiments): The reported SOTA numbers (e.g., 66.0 COCO mAP, 86.6 ImageNet robustness) are presented without ablations, baseline comparisons, error bars, or details on the video data engine construction. This leaves the generality claim resting on unshown controls for data curation, layer selection, and alignment hyperparameters.
- [§3] §3 (Alignment methods): It is unclear whether the language and spatial alignment procedures use a single fixed recipe or involve per-task data filtering, learning-rate schedules, or validation-based layer selection. If the best intermediate layer is chosen via downstream performance, the 'single general pretraining recipe' claim is undermined.
- [§4.2–4.3] §4.2–4.3 (Downstream results): No comparison is shown to the final-layer embeddings of the same model or to standard contrastive baselines (e.g., CLIP, SigLIP) under identical alignment procedures, making it impossible to isolate the contribution of the intermediate-layer observation versus the alignment steps themselves.
minor comments (2)
- [§3] Notation for the two alignment losses is introduced without explicit equations; adding them would clarify how they differ from the original contrastive objective.
- [§4] Figure captions and tables lack details on the exact layer indices used for each task; a single table summarizing best-layer indices across benchmarks would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to add requested ablations, clarifications, and comparisons.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Experiments): The reported SOTA numbers (e.g., 66.0 COCO mAP, 86.6 ImageNet robustness) are presented without ablations, baseline comparisons, error bars, or details on the video data engine construction. This leaves the generality claim resting on unshown controls for data curation, layer selection, and alignment hyperparameters.
Authors: We agree that additional controls strengthen the claims. The revised §4 now includes ablations on data curation, layer selection, and alignment hyperparameters, plus expanded details on the video data engine. Error bars are reported for key results; where omitted, this is due to compute limits and is noted. Baseline comparisons have been made more explicit. revision: yes
-
Referee: [§3] §3 (Alignment methods): It is unclear whether the language and spatial alignment procedures use a single fixed recipe or involve per-task data filtering, learning-rate schedules, or validation-based layer selection. If the best intermediate layer is chosen via downstream performance, the 'single general pretraining recipe' claim is undermined.
Authors: The procedures use a single fixed recipe with shared hyperparameters. Layer selection follows a general validation protocol (not per-task optimization or custom filtering/schedules). We have revised §3 to state this explicitly, confirming the pretraining remains general. revision: partial
-
Referee: [§4.2–4.3] §4.2–4.3 (Downstream results): No comparison is shown to the final-layer embeddings of the same model or to standard contrastive baselines (e.g., CLIP, SigLIP) under identical alignment procedures, making it impossible to isolate the contribution of the intermediate-layer observation versus the alignment steps themselves.
Authors: We have added comparisons to final-layer embeddings under identical alignment in revised §4.2–4.3. We also report CLIP and SigLIP baselines aligned identically, isolating the intermediate-layer contribution. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's central claims rest on empirical scaling results from standard contrastive vision-language pretraining, followed by two explicit alignment procedures (language and spatial) whose outputs are evaluated on downstream benchmarks. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce reported performance to inputs by construction. The intermediate-layer superiority is stated as an observed outcome after training, not a definitional or self-referential result. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contrastive vision-language loss produces transferable embeddings when scaled
Forward citations
Cited by 22 Pith papers
-
What Cohort INRs Encode and Where to Freeze Them
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garm...
-
Qwen-Image-VAE-2.0 Technical Report
Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
-
Cross-Attentive Multiview Fusion of Vision-Language Embeddings
CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching S...
-
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
-
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding
UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection
Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.
-
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
-
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models...
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
Enabling High Error Tolerance in Satellite Video Transmissions by Generative Semantic Communication
A generative semantic communication method for satellite video achieves 2.5 dB higher PSNR than conventional semantic comms at 45% error rate and remains functional above 80% error by combining semantic encoding with ...
-
Sapiens2
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
-
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
-
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.
-
Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable
Life-logging video streams create an inevitable privacy-utility trade-off that is a foundational challenge for always-on AI systems.
-
CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation
CoCo-SAM3 improves SAM3 by aligning evidence from synonymous prompts for concept consistency and then running inter-class competition on a unified scale to reduce mask overlaps.
-
Adaptive Forensic Feature Refinement via Intrinsic Importance Perception
I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harmi...
Reference graph
Works this paper leans on
-
[1]
Nocaps: Novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InICCV, 2019. 14, 15, 16, 32
work page 2019
-
[2]
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, Albert Q. Jiang, Kartik Khandelwal, Timothée Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall,...
work page internal anchor Pith review arXiv 2024
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv:2308.12966, 2023. 20
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In NeurIPS, 2019. 3, 4, 6, 8, 9, 10, 30, 31, 32
work page 2019
-
[5]
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey A. Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Soft-NMS–Improving object detection with one line of code
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–Improving object detection with one line of code. InICCV, 2017. 30
work page 2017
-
[7]
Window attention is bugged: how not to interpolate position embeddings
Daniel Bolya, Chaitanya Ryali, Judy Hoffman, and Christoph Feichtenhofer. Window attention is bugged: how not to interpolate position embeddings. InICLR, 2023. 11, 29
work page 2023
-
[8]
Florian Bordes, Randall Balestriero, Quentin Garrido, Adrien Bardes, and Pascal Vincent. Guillotine regulariza- tion: Why removing layers is needed to improve generalization in self-supervised learning.arXiv:2206.13378,
-
[9]
Food-101 – Mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – Mining discriminative components with random forests. In ECCV, 2014. 9
work page 2014
-
[10]
Gary Bradski. The OpenCV library.Dr. Dobb’s Journal: Software Tools for the Professional Programmer , 2000. 22
work page 2000
-
[11]
Cascade R-CNN: Delving into high quality object detection
Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. InCVPR,
-
[12]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InECCV, 2020. 19
work page 2020
-
[13]
Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, and Christopher D. Manning. AuroraCap: Efficient, performant video detailed captioning and a new benchmark. In ICLR, 2025. 5
work page 2025
-
[14]
Hybrid task cascade for instance segmentation
Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Hybrid task cascade for instance segmentation. InCVPR,
-
[15]
Generative pretraining from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. InICML, 2020. 20
work page 2020
-
[16]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 20 35
work page 2020
-
[17]
Pali: A jointly-scaled multilingual language-image model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...
work page 2023
-
[18]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jia...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024. 1, 6, 7, 9, 10, 20, 26
work page 2024
-
[20]
Remote sensing image scene classification: Benchmark and state of the art
Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 2017. 9
work page 2017
-
[21]
arXiv preprint arXiv:2504.13180 , year=
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Suyog Jain, Miguel Martin, Huiyu Wang, Nikhila Ravi, Shashank Jain, Temmy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp ...
-
[22]
CAT-Seg: Cost aggregation for open-vocabulary semantic segmentation
Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. CAT-Seg: Cost aggregation for open-vocabulary semantic segmentation. InCVPR, 2024. 20
work page 2024
-
[23]
Vision transformers need registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR, 2024. 12, 17
work page 2024
-
[24]
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F....
work page 2023
-
[25]
S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...
-
[26]
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 1, 3, 6, 8, 9, 10, 30, 31, 32
work page 2009
-
[27]
VirTex: Learning visual representations from textual annotations
Karan Desai and Justin Johnson. VirTex: Learning visual representations from textual annotations. InCVPR,
-
[28]
Decoupling zero-shot semantic segmentation
Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. Decoupling zero-shot semantic segmentation. InCVPR,
-
[29]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2020. 1, 8, 9 36
work page 2020
-
[30]
Scalable pre-training of large autoregressive image models
Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autoregressive image models. InICML,
-
[31]
Scaling language-free visual representation learning
David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nico- las Ballas, Yann LeCun, Amir Bar, and Saining Xie. Scaling language-free visual representation learning. arXiv:2504.01017, 2025. 12, 13
-
[32]
Improving CLIP training with language rewrites
Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving CLIP training with language rewrites. In NeurIPS, 2023. 20
work page 2023
-
[33]
Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. InICLR, 2024. 1, 3, 9, 16, 20, 26
work page 2024
-
[34]
EVA: Exploring the limits of masked visual representation learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA: Exploring the limits of masked visual representation learning at scale. InCVPR, 2023. 1
work page 2023
-
[35]
EVA-02: A visual representation for neon genesis.Image and Vision Computing , 2024
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA-02: A visual representation for neon genesis.Image and Vision Computing , 2024. 1, 19
work page 2024
-
[36]
X3D: Expanding architectures for efficient video recognition
Christoph Feichtenhofer. X3D: Expanding architectures for efficient video recognition. InCVPR, 2020. 4
work page 2020
-
[37]
Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M
Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T. Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, and Alaaeldin El-Nouby. Multimodal autoregressive pre-training of large vision encoders. In CVPR, 2025. 1, 2, 10, 1...
work page 2025
-
[38]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
DataComp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...
work page 2023
-
[40]
Making the v in VQA matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in VQA matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017. 14, 15, 16, 32
work page 2017
-
[41]
LVIS: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019. 19
work page 2019
-
[42]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 1
work page 2016
-
[43]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. InICCV, 2017. 11, 12, 19, 29
work page 2017
-
[44]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 1, 19
work page 2022
-
[45]
RADIOv2.5: Improved baselines for agglomerative vision foundation models
Greg Heinrich, Mike Ranzinger, Hongxu, Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. RADIOv2.5: Improved baselines for agglomerative vision foundation models. InCVPR, 2025. 1, 10, 18
work page 2025
-
[46]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, 2021. 3, 8, 9, 30, 31
work page 2021
-
[47]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021. 3, 4, 8, 9, 30, 31, 32
work page 2021
-
[48]
Rotary position embedding for vision transformer
Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In ECCV, 2024. 20 37
work page 2024
-
[49]
Distilling the knowledge in a neural network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNeurIPS Deep Learning Workshop, 2015. 8
work page 2015
-
[50]
Deep networks with stochastic depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, 2016. 14, 17
work page 2016
-
[51]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, 2021. 3, 20
work page 2021
-
[52]
Space-time correspondence as a contrastive random walk
Allan Jabri, Andrew Owens, and Alexei Efros. Space-time correspondence as a contrastive random walk. In NeurIPS, 2020. 11, 19, 29
work page 2020
-
[53]
TGIF-QA: Toward spatio-temporal reasoning in visual question answering
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. InCVPR, 2017. 14, 15, 16, 32
work page 2017
-
[54]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 1, 20
work page 2021
-
[55]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. arXiv:1705.06950, 2017. 6, 9, 31, 32
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[56]
Referitgame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InEMNLP, 2014. 14, 15, 16, 32, 33
work page 2014
-
[57]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. 14, 15, 16, 32
work page 2016
-
[58]
Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InICCV,
-
[59]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV Workshop, 2013. 9
work page 2013
-
[60]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 2017. 27, 32
work page 2017
-
[61]
Imagenet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012. 1
work page 2012
-
[62]
HMDB: a large video database for human motion recognition
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. HMDB: a large video database for human motion recognition. InICCV, 2011. 9, 31, 32
work page 2011
-
[63]
Weicheng Kuo, Yin Cui, Xiuye Gu, A. J. Piergiovanni, and Anelia Angelova. F-VLM: open-vocabulary object detection upon frozen vision and language models. InICLR, 2023. 20
work page 2023
-
[64]
VeCLIP: Improving CLIP training via visual-enriched captions
Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, and Meng Cao. VeCLIP: Improving CLIP training via visual-enriched captions. InECCV, 2024. 5, 20
work page 2024
-
[65]
What matters when building vision-language models? In NeurIPS, 2024
Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? In NeurIPS, 2024. 27
work page 2024
-
[66]
LLaVA-OneVision: Easy visual task transfer.TMLR, 2025
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-OneVision: Easy visual task transfer.TMLR, 2025. 16, 20, 22
work page 2025
-
[67]
Unmasked teacher: Towards training-efficient video foundation models
Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. InICCV, 2023. 9
work page 2023
-
[68]
MVBench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A comprehensive multi-modal video understanding benchmark. InCVPR,
-
[69]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In CVPR, 2022. 1
work page 2022
-
[70]
An inverse scaling law for CLIP training
Xianhang Li, Zeyu Wang, and Cihang Xie. An inverse scaling law for CLIP training. InNeurIPS, 2023. 3
work page 2023
-
[71]
Xianhang Li, Zeyu Wang, and Cihang Xie. CLIPA-v2: Scaling CLIP training with 81.1% zero-shot imagenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy.arXiv:2306.15658, 2023. 3, 20
-
[72]
Exploring plain vision transformer backbones for object detection
Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In ECCV, 2022. 11, 19, 29
work page 2022
-
[73]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023. 14, 15, 16, 32
work page 2023
-
[74]
Scaling language-image pre-training via masking
Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. InCVPR, 2023. 20
work page 2023
-
[75]
Binsformer: Revisiting adaptive bins for monocular depth estimation
Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. TIP, 2024. 29
work page 2024
-
[76]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 2, 6, 9, 12, 14, 15, 16, 19, 27, 31, 32
work page 2014
-
[77]
LLaVA-NeXT: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaVA-NeXT: Improved reasoning, ocr, and world knowledge, 2024. 32, 33
work page 2024
-
[78]
Visual instruction tuning.NeurIPS, 2024
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2024. 20, 23
work page 2024
-
[79]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, 2021. 3, 19
work page 2021
-
[80]
Swin transformer v2: Scaling up capacity and resolution
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer v2: Scaling up capacity and resolution. InCVPR, 2022. 19
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.