hub Mixed citations

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei · 2025 · cs.CV · arXiv 2504.13181

Mixed citation behavior. Most common role is background (67%).

40 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together, our PE family of models achieves best-in-class results on a wide variety of tasks, including (1) zero-shot image and video classification and retrieval, simultaneously obtaining 86.6 average zero-shot ImageNet robustness and 76.9 zero-shot Kinetics-400 video classification; (2) document, image, and video Q&A, enabling 94.6 DocVQA, 80.9 InfographicVQA, and 82.7 PerceptionTest with an 8B LLM; and (3) spatial tasks such as detection, tracking, and depth estimation, setting a new COCO state-of-the-art of 66.0 box mAP. To foster further research, we release our models, code, and novel dataset of synthetically and human-annotated videos: https://github.com/facebookresearch/perception_models

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 3

citation-polarity summary

background 6 use method 2 unclear 1

representative citing papers

Is Dimensionality a Barrier for Retrieval Models?

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

The paper introduces SP-VTP as a new setting for egocentric manipulation, releases the EgoSPT dataset with first-frame spatial annotations, and proposes the SPOT model that outperforms non-prompted baselines on cross-scene trajectory prediction.

What Cohort INRs Encode and Where to Freeze Them

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

cs.CV · 2026-04-02 · unverdicted · novelty 7.0

Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garment support.

Adapting MLLMs for Nuanced Video Retrieval

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.

SAM 3: Segment Anything with Concepts

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification

cs.CV · 2025-09-25 · unverdicted · novelty 7.0

MoTIF adds temporal self-attention and automatic VLM-based concept discovery to concept bottleneck models for interpretable video classification, showing gains over prior global CBMs on benchmarks.

Uncovering the Latent Potential of Deep Intermediate Representations

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.

RADAR: Relative Angular Divergence Across Representations

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

RADAR is a geometrically grounded metric that predicts cross-domain transferability by comparing layer-wise representation trajectory distributions in foundation models.

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

DySink uses adaptive retrieval of relevant historical frames plus a sink anomaly gate to improve dynamic degree and temporal quality in minute-long autoregressive video generation.

Improved Baselines with Representation Autoencoders

cs.CV · 2026-05-18 · conditional · novelty 6.0

RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.

SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation

cs.CV · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

SegRAG is a training-free retrieval-augmented framework that extracts class-specific point prompts from a filtered DINOv3 feature bank to boost SAM3 semantic segmentation performance on standard and agricultural benchmarks.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

Qwen-Image-VAE-2.0 Technical Report

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.

Cross-Attentive Multiview Fusion of Vision-Language Embeddings

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.

Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

eess.AS · 2026-04-13 · unverdicted · novelty 6.0

A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on downstream tasks.

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.

Grounded World Model for Semantically Generalizable Planning

cs.RO · 2026-04-13 · conditional · novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.

citing papers explorer

Showing 40 of 40 citing papers.

Is Dimensionality a Barrier for Retrieval Models? cs.LG · 2026-05-22 · unverdicted · none · ref 173 · internal anchor
Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.
Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation cs.CV · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
The paper introduces SP-VTP as a new setting for egocentric manipulation, releases the EgoSPT dataset with first-frame spatial annotations, and proposes the SPOT model that outperforms non-prompted baselines on cross-scene trajectory prediction.
What Cohort INRs Encode and Where to Freeze Them cs.LG · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CV · 2026-05-02 · unverdicted · none · ref 239 · internal anchor
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining cs.CV · 2026-04-02 · unverdicted · none · ref 6 · internal anchor
Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garment support.
Adapting MLLMs for Nuanced Video Retrieval cs.CV · 2025-12-15 · unverdicted · none · ref 9 · internal anchor
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
SAM 3: Segment Anything with Concepts cs.CV · 2025-11-20 · unverdicted · none · ref 12 · internal anchor
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification cs.CV · 2025-09-25 · unverdicted · none · ref 3 · internal anchor
MoTIF adds temporal self-attention and automatic VLM-based concept discovery to concept bottleneck models for interpretable video classification, showing gains over prior global CBMs on benchmarks.
Uncovering the Latent Potential of Deep Intermediate Representations cs.LG · 2026-05-21 · unverdicted · none · ref 39 · internal anchor
Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.
RADAR: Relative Angular Divergence Across Representations cs.LG · 2026-05-21 · unverdicted · none · ref 3 · internal anchor
RADAR is a geometrically grounded metric that predicts cross-domain transferability by comparing layer-wise representation trajectory distributions in foundation models.
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation cs.CV · 2026-05-20 · unverdicted · none · ref 1 · internal anchor
DySink uses adaptive retrieval of relevant historical frames plus a sink anomaly gate to improve dynamic degree and temporal quality in minute-long autoregressive video generation.
Improved Baselines with Representation Autoencoders cs.CV · 2026-05-18 · conditional · none · ref 6 · internal anchor
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
SparseSAM: Structured Sparsification of Activations in Segment Anything Models cs.CV · 2026-05-17 · unverdicted · none · ref 2 · internal anchor
SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.
SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation cs.CV · 2026-05-17 · unverdicted · none · ref 28 · 2 links · internal anchor
SegRAG is a training-free retrieval-augmented framework that extracts class-specific point prompts from a filtered DINOv3 feature bank to boost SAM3 semantic segmentation performance on standard and agricultural benchmarks.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 98 · internal anchor
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
Qwen-Image-VAE-2.0 Technical Report cs.CV · 2026-05-13 · unverdicted · none · ref 2 · internal anchor
Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
Cross-Attentive Multiview Fusion of Vision-Language Embeddings cs.CV · 2026-04-14 · unverdicted · none · ref 1 · internal anchor
CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization eess.AS · 2026-04-13 · unverdicted · none · ref 53 · internal anchor
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on downstream tasks.
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment cs.CV · 2026-04-13 · unverdicted · none · ref 4 · internal anchor
TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 5 · internal anchor
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models cs.CV · 2026-04-08 · unverdicted · none · ref 40 · internal anchor
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection cs.CV · 2026-04-07 · unverdicted · none · ref 2 · internal anchor
Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG cs.CV · 2026-04-07 · unverdicted · none · ref 24 · internal anchor
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks cs.CV · 2026-04-05 · conditional · none · ref 17 · internal anchor
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models even on report generation.
From Local Matches to Global Masks: Template-Guided Instance Detection and Segmentation in Open-World Scenes cs.CV · 2026-03-03 · unverdicted · none · ref 4 · internal anchor
L2G-Det detects and segments novel object instances in open scenes by using local template patch matches to generate points that prompt an augmented SAM for global masks.
Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models cs.CV · 2026-02-02 · conditional · none · ref 1 · internal anchor
Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding cs.CV · 2025-12-19 · unverdicted · none · ref 5 · internal anchor
Chorus pretrains a shared 3D Gaussian scene encoder via multi-teacher distillation to capture holistic features from high-level semantics to fine-grained structure, with strong transfer on segmentation and point-cloud tasks using far fewer scenes.
VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models cs.CV · 2025-10-21 · unverdicted · none · ref 3 · internal anchor
VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.
Deepfake Detection that Generalizes Across Benchmarks cs.CV · 2025-08-08 · accept · none · ref 8 · internal anchor
GenD achieves state-of-the-art average cross-dataset AUROC in deepfake detection by parameter-efficient adaptation of a foundational vision encoder with hyperspherical manifold enforcement via L2 normalization and metric learning.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 7 · internal anchor
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics cs.LG · 2025-06-02 · unverdicted · none · ref 6 · internal anchor
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Enabling High Error Tolerance in Satellite Video Transmissions by Generative Semantic Communication eess.SP · 2026-04-28 · unverdicted · none · ref 23 · internal anchor
A generative semantic communication method for satellite video achieves 2.5 dB higher PSNR than conventional semantic comms at 45% error rate and remains functional above 80% error by combining semantic encoding with generative reconstruction.
Sapiens2 cs.CV · 2026-04-23 · unverdicted · none · ref 8 · internal anchor
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding cs.LG · 2026-04-14 · unverdicted · none · ref 46 · internal anchor
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference cs.CV · 2026-04-13 · unverdicted · none · ref 6 · internal anchor
Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.
SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images cs.CV · 2025-12-09 · unverdicted · none · ref 4 · internal anchor
SAM 3 can be applied training-free to remote sensing open-vocabulary segmentation and change detection by fusing its semantic and instance heads and filtering with presence scores.
Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable cs.CV · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
Life-logging video streams create an inevitable privacy-utility trade-off that is a foundational challenge for always-on AI systems.
CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation cs.CV · 2026-04-21 · unverdicted · none · ref 3 · internal anchor
CoCo-SAM3 improves SAM3 by aligning evidence from synonymous prompts for concept consistency and then running inter-class competition on a unified scale to reduce mask overlaps.
Adaptive Forensic Feature Refinement via Intrinsic Importance Perception cs.CV · 2026-04-18 · unverdicted · none · ref 6 · internal anchor
I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harming generalization.

Perception Encoder: The best visual embeddings are not at the output of the network

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer