{"total":27,"items":[{"citing_arxiv_id":"2605.20808","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis","primary_cat":"cs.CV","submitted_at":"2026-05-20T06:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20624","ref_index":65,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T02:16:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVIS applies autoregressive diffusion models to video inverse problems by streaming restoration with measurement-consistent initialization, reducing latency from 114s to 4s and raising throughput to 1.18 FPS (or 5.91 FPS in the Flash variant).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19750","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T12:18:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CPC-VAR adds Gradient-based Concept Neuron Selection for continual single-concept learning and a context-aware multi-branch composition strategy to reduce forgetting and entanglement in VAR-based personalized image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16147","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Registers Matter for Pixel-Space Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-15T16:27:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15923","ref_index":21,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:06:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15618","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Latent Video Prediction Learns Better World Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T04:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Latent prediction video models exhibit a distinct robustness profile across corruption, occlusion, fine-grained discrimination, and temporal sensitivity compared to other self-supervised video models when used as world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14486","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection","primary_cat":"cs.CV","submitted_at":"2026-05-14T07:26:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SEF introduces GAN upsampling for diverse artifacts and expert fusion to reduce domain interference, yielding stronger generalization on 13 benchmarks for AI-generated image detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13013","ref_index":38,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-13T05:07:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12678","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"No One Knows the State of the Art in Geospatial Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T19:29:51+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12496","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"shot interactions, viewpoint changes, and long temporal gaps. Following VBench [19], we evaluate visual quality, prompt following, temporal consistency, long-range consistency, and shot structure. Specifically, we report LAION aesthetic score [36], shot-level ViCLIP text-video similarity [46, 45], within-shot subject/background consistency using DINO [ 5] and CLIP [ 35], inter-shot character consistency using DINOv2 [32] on matched pairs, and shot-cut accuracy (SCA) [31] by matching TransNetV2 [39]-detected cuts to target boundaries. To ensure a fair comparison, all baselines generate videos under identical settings as ours, using the same set of prompts, resolution, and length. 6 0-6s6-12s12-18s18-24s"},{"citing_arxiv_id":"2605.12305","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:54:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Existing benchmarks, such as DreamBench++ [26] and OmniContext [41], often lack the complexity required for robust evaluation due to their limited reference images and simple spatial relationships. To address this gap, we introduceInterleaveBench, a rigorous benchmark designed for complex multi-image scenarios. Dataset Curation.We source high-quality reference entities from DreamBench++ [26]. For each test case, we sample N∈ [2, 5]distinct images and employ a VLM to filter for semantic compatibility. We then generate intricate interleaved instructions that mandate logical spatial reasoning and adaptive attribute modification, rather than simple composition. To ensure quality, all samples undergo rigorous human verification to filter out unnatural or conflicting prompts. Evaluation Protocol."},{"citing_arxiv_id":"2605.10661","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-11T14:43:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the representation space is sufficiently wide. 1 Introduction Transformers have become the central architecture in modern deep learning, achieving strong per- formance across language, vision and multimodal tasks [ 28, 37, 41]. In computer vision, Vision Transformers (ViTs) have proven particularly effective for image classification, transfer learning and visual representation learning [2, 7, 13, 36]. Standard ViTs process an image through a sequence of transformer blocks that have the same architectural form but independently learned parameters. This design is highly effective, but it also raises a basic question: how much of the performance of deep ViTs requires layer specific transformations and how much can be recovered by repeated refinement"},{"citing_arxiv_id":"2605.07915","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-08T15:52:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InInternational Conference on Learning Representations. [4] Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models.arXiv preprint arXiv:2510.18457, 2025. [5] Ramón Calvo-González and François Fleuret. Laminating representation autoencoders for efficient diffusion, 2026. URLhttps://arxiv.org/abs/2602.04873. [6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650-9660, 2021. [7] Hun Chang, Byunghee Cha, and Jong Chul Ye. Dino-sae: Dino spherical autoencoder for"},{"citing_arxiv_id":"2605.07338","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs","primary_cat":"cs.CV","submitted_at":"2026-05-08T06:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07257","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Adaptive Subspace Projection for Generative Personalization","primary_cat":"cs.CV","submitted_at":"2026-05-08T05:24:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"For prompt fidelity, we report CLIP-T in two settings: CLIP-Tf , computed with the full prompt ⌊p, c⌋, and CLIP-Tp, computed with the context prompt p only. For subject fidelity, we report CLIP-I, which measures similarity between generated images and reference subject images using CLIP visual features [ 28, 12]. We also report DINO image-image similarity in a self-supervised feature space [5]. Although these metrics are commonly used for personalized generation, they have known limitations. We discuss these limitations in Section A.1 and provide human evaluation to better assess perceptual quality, subject preservation, and prompt consistency. 5.2 Evaluation Results We conducted extensive experiments to validate the effectiveness of our proposed method, AdaptSP,"},{"citing_arxiv_id":"2605.07055","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness","primary_cat":"cs.CV","submitted_at":"2026-05-08T00:04:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multimodal baselines on UK Biobank.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Self-Supervised Learning.Self-supervised learning (SSL) serves as the foundation for FM develop- ment. Contrastive learning methods use positive and negative pairs [13, 15, 24, 37, 56]. Negative-free SSL methods rely on architectural asymmetry [ 3, 4, 14, 23] or covariance regularization [ 7, 60]. Recent methods use teacher-student momentum updates [10, 35] with self-distillation, and masked image modeling reconstructs masked patches [5, 25, 57, 58, 64]. While recent extensions explore adaptive image-patch masking strategies [27, 29, 45], Pan-FM extends them to the organ level via organ saliency-guided multinomial sampling, which mitigates dominant-organ shortcut learning under missing-organ scenarios."},{"citing_arxiv_id":"2605.06509","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-05-07T16:21:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05206","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Taming Outlier Tokens in Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-06T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04943","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring","primary_cat":"cs.CV","submitted_at":"2026-05-06T14:12:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DART is a cross-modal foundation model that delivers rope damage classification, severity regression, and few-shot recognition from a single frozen representation trained on 4270 images across 14 damage classes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11389","ref_index":43,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines","primary_cat":"cs.CV","submitted_at":"2026-04-13T12:29:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ConvFormer3D-TAP classifies six cine CMR views at 96% accuracy using 3D conv tokenization, multiscale attention, and uncertainty-aware multi-clip fusion on 150k sequences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02509","ref_index":34,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rapidly deploying on-device eye tracking by distilling visual foundation models","primary_cat":"cs.CV","submitted_at":"2026-04-02T21:07:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"parameter count for ConvNeXt-S and ViT-B backbones. Linear probing (gray) performs poorly despite high parameter counts. Synthetic fine-tuning and our optimization substantially reduce error. Distilled on- device students (left) approach optimized VFM accuracy with 100-200×fewer parameters. Teacher loss function.Conventional self-supervised methods [34, 35] employ centering and prototype- based soft clustering to prevent representation col- lapse. However, we find that a simple MSE regression loss substantially outperforms the DINO objective in our setting (Table 3). We attribute this to two key factors. First, soft clustering encourages discrete, categorical representations that may limit the fine-"},{"citing_arxiv_id":"2603.02210","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images","primary_cat":"cs.CV","submitted_at":"2026-03-02T18:59:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.15572","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers","primary_cat":"cs.CV","submitted_at":"2025-11-19T16:03:21+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.20512","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Adversarial Concept Distillation for One-Step Diffusion Personalization","primary_cat":"cs.CV","submitted_at":"2025-10-23T12:56:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.01925","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","primary_cat":"cs.RO","submitted_at":"2025-07-02T17:34:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"internet-scale data, which acquire broad and transferable capabilities by capturing the diverse knowledge and patterns embedded in their training corpora. As a prominent example, Large Language Models (LLMs), such as GPT-4 [3] and DeepSeek-R1 [4], excel at natural language understanding, reasoning, and generation, forming the backbone of many text-based applications. In parallel, Vision Foundation Models (VFMs), such as CLIP [5], DINO [6, 7], and SAM [8, 9], have shown strong generalization across a wide range of vision tasks. Building upon these, Vision-Language Models (VLMs), exemplified by GPT-4o [10], Gemini 2.5 Pro [11], and Qwen2.5-VL [12], integrate visual and textual modalities to enable multimodal processing and generation. Collectively, these models encode vast world knowledge, exhibit strong performance on complex tasks, and"},{"citing_arxiv_id":"2506.09110","ref_index":77,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model","primary_cat":"cs.LG","submitted_at":"2025-06-10T17:20:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CodeBrain introduces a decoupled TFDual-Tokenizer and multi-scale EEGSSM architecture for an EEG foundation model pretrained on a large corpus, claiming strong generalization across eight downstream tasks and ten datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.19519","ref_index":56,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift","primary_cat":"cs.CV","submitted_at":"2025-05-26T05:03:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes Lipschitz regularization during fine-tuning to prevent distributional drift in personalized diffusion models, improving subject fidelity and prompt adherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}