super hub

DINOv2: Learning Robust Visual Features without Supervision

Huy Vo, Marc Szafraniec, Maxime Oquab, Vasil Khalidov · 2023 · cs.CV · arXiv 2304.07193

317 Pith papers cite this work. Polarity classification is still indexing.

317 Pith papers citing it

open full Pith review browse 317 citing papers more from Huy Vo arXiv PDF

abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1 dataset 1 method 1

citation-polarity summary

background 2 baseline 1 use method 1

claims ledger

abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

Huy Vo Marc Szafraniec Maxime Oquab Th\'eo Moutakanni Timoth\'ee Darcet Vasil Khalidov

co-cited works

representative citing papers

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

RotVLA: Rotational Latent Action for Vision-Language-Action Model

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

Does Engram Do Memory Retrieval in Autoregressive Image Generation?

cs.CV · 2026-05-13 · accept · novelty 7.0

Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.

SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude fewer than standard approaches.

Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Embedding Temporal Logic enables runtime monitoring of temporally extended perceptual behaviors by defining predicates via distances between observed and reference embeddings in learned spaces, with conformal calibration for reliable evaluation.

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

PointGS achieves semantic-consistent unsupervised 3D point cloud segmentation by using 3D Gaussian Splatting to bridge discrete points and continuous 2D images for distilling SAM semantics.

STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

cs.RO · 2026-05-12 · conditional · novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

cs.CV · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

cs.CV · 2026-05-09 · conditional · novelty 7.0

Raw CSD cosine similarity produces negative discrimination gaps for many artists and does not support absolute style-fidelity interpretation, but CSLS readout on frozen backbones reduces failures and improves AUC.

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.

MPD$^2$-Router: Mask-aware Multi-expert Prior-regularized Dual-head Deferral Router in Glaucoma Screening and Diagnosis

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

MPD²-Router is a dual-head deferral router that uses mask-aware Gumbel-sigmoid gating, asymmetric cost-sensitive training, and rank-majorization regularization to lower clinical cost and raise MCC versus AI-only baselines while balancing expert utilization across three glaucoma cohorts.

Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks

eess.IV · 2026-05-08 · unverdicted · novelty 7.0

Self-supervised monocular depth estimation improves in low-texture regions by using distance transforms on jointly estimated pre-semantic contours to create more informative loss signals.

What Cohort INRs Encode and Where to Freeze Them

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.

SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.

citing papers explorer

Showing 50 of 238 citing papers after filters.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing cs.CV · 2026-04-17 · unverdicted · none · ref 39 · internal anchor
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation cs.CV · 2026-04-13 · unverdicted · none · ref 20 · internal anchor
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Does Engram Do Memory Retrieval in Autoregressive Image Generation? cs.CV · 2026-05-13 · accept · none · ref 7 · internal anchor
Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives cs.CV · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models cs.CV · 2026-05-12 · unverdicted · none · ref 45 · internal anchor
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.
PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting cs.CV · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
PointGS achieves semantic-consistent unsupervised 3D point cloud segmentation by using 3D Gaussian Splatting to bridge discrete points and continuous 2D images for distilling SAM semantics.
STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models cs.CV · 2026-05-12 · unverdicted · none · ref 29 · internal anchor
STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization cs.CV · 2026-05-11 · unverdicted · none · ref 17 · 2 links · internal anchor
DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.
When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation cs.CV · 2026-05-09 · conditional · none · ref 9 · internal anchor
Raw CSD cosine similarity produces negative discrimination gaps for many artists and does not support absolute style-fidelity interpretation, but CSLS readout on frozen backbones reduces failures and improves AUC.
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs cs.CV · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis cs.CV · 2026-05-08 · unverdicted · none · ref 72 · internal anchor
SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling cs.CV · 2026-05-08 · unverdicted · none · ref 25 · internal anchor
LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting cs.CV · 2026-05-08 · unverdicted · none · ref 37 · 2 links · internal anchor
GS-DIFF detects changes in 3D Gaussian Splatting scenes by direct primitive attribute comparison with anisotropic drift models and observability terms, outperforming render-then-compare baselines by ~17% mIoU.
Learning Visual Feature-Based World Models via Residual Latent Action cs.CV · 2026-05-08 · unverdicted · none · ref 38 · internal anchor
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping cs.CV · 2026-05-06 · unverdicted · none · ref 34 · internal anchor
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.
Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures cs.CV · 2026-05-05 · unverdicted · none · ref 55 · internal anchor
HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.
ReLeaf: Benchmarking Leaf Segmentation across Domains and Species cs.CV · 2026-05-05 · unverdicted · none · ref 25 · internal anchor
A YOLO26 model trained on four leaf segmentation datasets reaches 83.9% mean mAP50-95 on their test sets but only 40.2% on a new 23-species benchmark, revealing substantial cross-domain generalization gaps.
Automated In-the-Wild Data Collection for Continual AI Generated Image Detection cs.CV · 2026-05-04 · unverdicted · none · ref 47 · internal anchor
An automated fact-check-based pipeline for in-the-wild AI image data, when mixed with generator data in continual learning, lets detectors adapt to new generators while avoiding forgetting and delivers 8-9% accuracy gains on two existing models.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CV · 2026-05-02 · unverdicted · none · ref 226 · internal anchor
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Towards Visual Query Localization in the 3D World cs.CV · 2026-05-02 · unverdicted · none · ref 20 · internal anchor
The authors release the 3DVQL benchmark for 3D multimodal visual query localization and show that a lift-and-attention fusion module outperforms prior fusion baselines on it.
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance cs.CV · 2026-05-02 · unverdicted · none · ref 36 · internal anchor
Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and verified across 60+ training runs.
Rethink MAE with Linear Time-Invariant Dynamics cs.CV · 2026-04-29 · unverdicted · none · ref 7 · internal anchor
Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision cs.CV · 2026-04-29 · conditional · none · ref 32 · internal anchor
AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection cs.CV · 2026-04-29 · unverdicted · none · ref 31 · internal anchor
Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models cs.CV · 2026-04-27 · unverdicted · none · ref 14 · internal anchor
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
VitaminP: cross-modal learning enables whole-cell segmentation from routine histology cs.CV · 2026-04-26 · unverdicted · none · ref 9 · internal anchor
VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation cs.CV · 2026-04-26 · unverdicted · none · ref 29 · 2 links · internal anchor
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images cs.CV · 2026-04-23 · unverdicted · none · ref 20 · internal anchor
WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.
Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback cs.CV · 2026-04-22 · unverdicted · none · ref 25 · internal anchor
Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.
Evaluating Remote Sensing Image Captions Beyond Metric Biases cs.CV · 2026-04-22 · unverdicted · none · ref 36 · internal anchor
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
TransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing cs.CV · 2026-04-21 · unverdicted · none · ref 20 · internal anchor
TransSplat uses unbalanced semantic transport to match edited 2D evidence with 3D Gaussians and recover a shared 3D edit field, yielding better local accuracy and structural consistency than prior view-consistency methods.
Generative Texture Filtering cs.CV · 2026-04-21 · unverdicted · none · ref 29 · internal anchor
A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model cs.CV · 2026-04-20 · unverdicted · none · ref 7 · internal anchor
Distillation from visual foundation models to lidar enables frame-wise indoor semantic segmentation without manual annotations, achieving up to 56% mIoU on pseudo labels and 36% on real labels.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models cs.CV · 2026-04-20 · unverdicted · none · ref 28 · internal anchor
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
CFSR: Geometry-Conditioned Shadow Removal via Physical Disentanglement cs.CV · 2026-04-20 · unverdicted · none · ref 37 · internal anchor
CFSR reframes shadow removal as a physics-constrained process using geometric and semantic priors from depth, DINO, CLIP, and frequency decoupling to achieve claimed state-of-the-art results.
MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene cs.CV · 2026-04-20 · unverdicted · none · ref 31 · internal anchor
MU-GeNeRF combines source-view and target-view uncertainties via a heteroscedastic loss to enable distractor-aware generalizable NeRF reconstruction that matches scene-specific methods.
DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection cs.CV · 2026-04-20 · unverdicted · none · ref 35 · internal anchor
DifFoundMAD improves differential morphing attack detection by replacing traditional embeddings with those from vision foundation models and applying class-balanced lightweight fine-tuning, cutting high-security error rates from 6.16% to 2.17%.
Coevolving Representations in Joint Image-Feature Diffusion cs.CV · 2026-04-19 · unverdicted · none · ref 31 · internal anchor
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation cs.CV · 2026-04-16 · unverdicted · none · ref 15 · internal anchor
Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes cs.CV · 2026-04-16 · unverdicted · none · ref 26 · internal anchor
Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism cs.CV · 2026-04-16 · unverdicted · none · ref 38 · internal anchor
OmniGCD trains a Transformer once on synthetic data to enable zero-shot generalized category discovery across 16 datasets in four modalities without any dataset-specific fine-tuning.
OneHOI: Unifying Human-Object Interaction Generation and Editing cs.CV · 2026-04-15 · unverdicted · none · ref 29 · internal anchor
OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.
ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding cs.CV · 2026-04-15 · unverdicted · none · ref 24 · internal anchor
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization cs.CV · 2026-04-14 · conditional · none · ref 1 · internal anchor
VideoFlexTok introduces coarse-to-fine variable-length video tokenization that enables 5x smaller models to match grid-based generation quality and supports 10-second videos with 8x fewer tokens.
IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation cs.CV · 2026-04-14 · unverdicted · none · ref 29 · internal anchor
IAD-Unify unifies industrial anomaly segmentation, region-grounded language understanding, and mask-guided generation in one framework using DINOv2 token injection into Qwen3.5, supported by the new Anomaly-56K dataset of 59,916 images.
VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale cs.CV · 2026-04-14 · unverdicted · none · ref 33 · internal anchor
VidTAG achieves fine-grained global video-to-GPS geolocalization via temporal frame alignment and denoising sequence refinement, reporting 20% gains at 1 km over GeoCLIP and 25% on CityGuessr68k.
Scene Change Detection with Vision-Language Representation Learning cs.CV · 2026-04-13 · unverdicted · none · ref 31 · internal anchor
LangSCD fuses VLM-generated text descriptions with visual features and adds geometric-semantic matching to improve scene change detection, while releasing the NYC-CD dataset of 8122 New York City image pairs with multiclass annotations.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 50 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding cs.CV · 2026-04-13 · unverdicted · none · ref 19 · internal anchor
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.

DINOv2: Learning Robust Visual Features without Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer