{"total":20,"items":[{"citing_arxiv_id":"2605.17368","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric Projection","primary_cat":"cs.CV","submitted_at":"2026-05-17T10:22:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RadGenome-Anatomy is a large-scale chest radiograph dataset with anatomy labels obtained by projecting 3D CT masks into 2D radiographic space for 210 structures in 25,692 studies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16922","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Motion Cues from Image-based Point Tracking for LiDAR Scene Flow Estimation","primary_cat":"cs.CV","submitted_at":"2026-05-16T10:28:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TrackCue uses dense image-space trajectories from point tracking and ego-motion compensation to improve static-dynamic classification and supervision for LiDAR scene flow estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13073","ref_index":11,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HarmoGS: Robust 3D Gaussian Splatting in the Wild via Conflict-Aware Gradient Harmonization","primary_cat":"cs.CV","submitted_at":"2026-05-13T06:47:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HarmoGS adds semantic consistency-guided masking and dual-view orthogonal gradient harmonization to 3D Gaussian Splatting to reduce artifacts from distractors and cross-view illumination inconsistencies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12678","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"No One Knows the State of the Art in Geospatial Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T19:29:51+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12501","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Covering Human Action Space for Computer Use: Data Synthesis and Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"placed onto blank canvases, with their type, color, size, canvas background color, width, and height all sampled at random. Different shapes may require different forms of positional annotation; for example, triangles are annotated by the coordinates of their vertices. All such geometric information is recorded in the annotations. 4.4 Natural Image Modal For natural images, we use data from SAM [ 40]. For each image, we first randomly sample five regions. Because these regions do not come with sufficiently detailed captions, we use GPT-4o [41] to generate fine-grained descriptions for each selected region. SAM itself provides the bounding box and segmentation mask for every region. Based on these masks, we apply the Suzuki-Abe contour extraction algorithm [42], followed by contour sampling, to obtain polygonal boundary curves."},{"citing_arxiv_id":"2605.11107","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-11T18:13:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation with no minority examples in training.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015-4026, 2023. [18] Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine- tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054, 2022. [19] Phuong Quynh Le, Jörg Schlötterer, and Christin Seifert. Out of spuriousity: Improving robustness to spurious correlations without group annotations.arXiv preprint arXiv:2407.14974, 2024. [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context."},{"citing_arxiv_id":"2605.09904","ref_index":17,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T02:47:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Each queried subject in TOC-Bench is grounded in a per-frame object track, and each question is derived from object-specific event timelines rather than from video-text description pairs. 2.3 Object-Centric Video Grounding and Tracking Object-centric video grounding has been advanced by segmentation, tracking, and referring video object segmentation methods such as Segment Anything [ 17], SAM2 [33], Track Anything [50], SAM2Long [10], and VideoRefer Suite [54]. These methods provide tools for localizing, segmenting, and following objects in videos. TOC-Bench uses them as infrastructure rather than as evaluation targets: the benchmark asks whether Video-LLMs can use object-track evidence to answer temporally grounded questions about the same object over time."},{"citing_arxiv_id":"2605.07550","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views","primary_cat":"cs.CV","submitted_at":"2026-05-08T10:24:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"strong local correspondences without explicit camera calibration. To tackle global scene align- ment, sophisticated solvers and alignment frame- works such as Fast3R [10], Easi3r [11], Align3r [9], and VGGT [31] have been proposed to ef- ficiently register and merge multiple partial 3D reconstructions. In parallel, advances in spatial understanding, such as 3D-SAM [32], have pro- vided powerful tools for segmenting and struc- turing these extracted environments. Despite their robustness to noise and wide baselines, all of these state-of-the-art alignment strategies op- erate under the fundamental assumption of overlapping geometry. They fail to bridge the spatial gap when input sets share zero corresponding features."},{"citing_arxiv_id":"2605.06010","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-07T11:03:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FusionProxy is a distilled diffusion-based fusion module that adds thermal awareness to RGB vision systems in real time as an independent plug-and-play component.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02638","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking","primary_cat":"cs.CV","submitted_at":"2026-05-04T14:23:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViewSAM achieves state-of-the-art weakly supervised performance on cross-view referring multi-object tracking by refining SAM tracklets via affinity-guided re-prompting and modeling view-induced variations as learnable conditions on SAM2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00061","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"UniBCI: Towards a Unified Pretrained Model for Invasive Brain-Computer Interfaces","primary_cat":"cs.NE","submitted_at":"2026-04-30T06:30:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniBCI is a unified pretrained model for invasive neural spike data that uses CST tokenization, IAA attention, and self-supervised masked reconstruction to achieve SOTA downstream performance with better generalization and efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05828","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Precise Aggressive Aerial Maneuvers with Sensorimotor Policies","primary_cat":"cs.RO","submitted_at":"2026-04-07T13:00:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reinforcement learning sensorimotor policies enable quadrotors to traverse narrow gaps at extreme tilts with 5 cm clearance using only vision and proprioception, including reactive traversal of moving gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20328","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Video models are zero-shot learners and reasoners","primary_cat":"cs.LG","submitted_at":"2025-09-24T17:17:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"through few-shot in-context learning [7, 9] and zero-shot learning [10]. Zero-shot learning here means that prompting a model with a task instruction replaces the need for fine-tuning or adding task-specific inference heads. Machine vision today in many ways resembles the state of NLP a few years ago: There are excellent task-specific models like \"Segment Anything\" [11, 12] for segmentation or YOLO variants for object detection [13, 14]. While attempts to unify some vision tasks exist [15-25], no existing model can solveanyproblem just by prompting. However, the exact same primitives that enabled zero-shot learning in NLP also apply to today's generative video models-large-scale training with a generative objective (text/video continuation) on web-scale data [26]."},{"citing_arxiv_id":"2507.05193","ref_index":36,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis","primary_cat":"eess.IV","submitted_at":"2025-07-07T16:53:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.02546","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details","primary_cat":"cs.CV","submitted_at":"2025-07-03T11:40:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9492-9502, 2024. [28] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482-7491, 2018. [29] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015-4026, 2023. [30] Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Körner. Evaluation of cnn-based single-"},{"citing_arxiv_id":"2506.09082","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models","primary_cat":"cs.CV","submitted_at":"2025-06-10T05:43:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"1 Introduction Vision Foundation Models (VFMs), pre-trained on large and diverse datasets, have become central to AI by providing transferable features for a wide range of downstream tasks [41, 2, 14, 5]. The variety of pre-training objectives and supervision signals has led to a proliferation of specialized VFMs- such as DINOv2 [71], CLIP [73], and SAM [44]-each excelling in distinct visual capabilities while often exhibiting interesting emergent properties [25, 6, 70]. Consequently, establishing a systematic and effective evaluation protocol for VFMs has become increasingly crucial. Existing evaluation protocols can generally be categorized into two groups. The first focuses on task-specific capabilities, typically attaching tailored heads to VFMs, followed by fine-tuning and"},{"citing_arxiv_id":"2506.01247","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering","primary_cat":"cs.CV","submitted_at":"2025-06-02T01:51:20+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.20275","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ImgEdit: A Unified Image Editing Dataset and Benchmark","primary_cat":"cs.CV","submitted_at":"2025-05-26T17:53:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Moh: Multi-head attention as mixture-of-head attention. arXiv preprint arXiv:2410.11842, 2024. [30] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007-6017, 2023. 11 [31] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015-4026, 2023. [32] Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy."},{"citing_arxiv_id":"2505.15616","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models","primary_cat":"cs.CV","submitted_at":"2025-05-21T15:06:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LENS is a new multi-level benchmark dataset for evaluating MLLMs on perception-to-reasoning tasks using the same images across all levels with recent social media content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.09568","ref_index":13,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset","primary_cat":"cs.CV","submitted_at":"2025-05-14T17:11:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"noise at each step, the real images are generated by the V AE decoder. V AE + MSE Because our focus is on autoregressive + diffusion framework, we exclude V AE + MSE approaches, as they do not incorporate any diffusion module. Implementation Details To compare various design choices, we use Llama-3.2-1B-Instruct as autoregressive model. Our training data consists of CC12M [3], SA-1B [13], and JourneyDB [30], amounting to approximately 25 million samples. For CC12M and SA-1B, we utilize the detailed captions generated by LLaV A, while for JourneyDB we use the original captions. The detailed description of image generation architecture using flow matching loss is provided in Section 5.1. 8 Results We report the FID score [10] on MJHQ-30k [15] for visual aesthetic quality, along with"}],"limit":50,"offset":0}