{"total":160,"items":[{"citing_arxiv_id":"2605.23747","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox","primary_cat":"cs.CV","submitted_at":"2026-05-22T15:20:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Stabilized SegFormer-B5 reaches 0.4572 mIoU SOTA on original Apple DMS split; 80/10/10 split reaches 0.5276 mIoU but degrades real-world OOD performance per qualitative review.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23655","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception","primary_cat":"cs.CV","submitted_at":"2026-05-22T14:07:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CVSearch proposes an Assess-then-Search workflow combining expert-assisted search with Semantic Guided Adaptive Patching and Dynamic Bottom-Up Search to improve efficiency and accuracy on high-resolution image tasks for MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23610","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-22T13:20:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23523","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ComPose: When to Trust Hands for Object Pose Tracking","primary_cat":"cs.CV","submitted_at":"2026-05-22T11:39:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ComPose tracks object poses in hand-occluded RGB videos by adaptively fusing cues from object and hand foundation models, selecting informative joints, and enforcing temporal consistency without external smoothing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22962","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GazeBehavior Annotation Toolkit (GBAT): AI-powered toolkit for automatic annotation of egocentric eye-tracking and video data of child-caregiver interaction","primary_cat":"cs.CV","submitted_at":"2026-05-21T18:47:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GBAT is an AI toolkit that automates synchronization, gaze annotation, and action categorization for egocentric eye-tracking videos of early childhood interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22272","ref_index":62,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors","primary_cat":"cs.RO","submitted_at":"2026-05-21T10:15:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Imagine2Real enables zero-shot humanoid-object interaction by unifying motions as 4D point trajectories, tracking only base/hands/object keypoints inside a BFM latent space, and training with progressive simple rewards for mocap deployment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22183","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Action with Visual Primitives","primary_cat":"cs.RO","submitted_at":"2026-05-21T08:52:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22144","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems","primary_cat":"cs.CV","submitted_at":"2026-05-21T08:15:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22068","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition","primary_cat":"cs.CV","submitted_at":"2026-05-21T07:03:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"COCOTree is a 21K-image benchmark with 1.8M nodes and an OTQ metric for the new task of open tree-structured visual decomposition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21186","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAM-Sode: Towards Faithful Explanations for Tiny Bacteria Detection","primary_cat":"cs.CV","submitted_at":"2026-05-20T13:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SAM-Sode refines explanation maps for tiny bacteria detection by converting them into prompts for the SAM3 model and applying physical and geometric dual constraints to suppress background noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21059","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multimodal LLMs under Pairwise Modalities","primary_cat":"cs.CV","submitted_at":"2026-05-20T11:44:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20808","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis","primary_cat":"cs.CV","submitted_at":"2026-05-20T06:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20676","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence","primary_cat":"cs.CV","submitted_at":"2026-05-20T03:44:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VISTAQA is a new benchmark for joint visual question answering correctness and pixel-level grounding, evaluated with the GROVE metric that uses per-sample geometric mean to require both dimensions to succeed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20448","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?","primary_cat":"cs.CV","submitted_at":"2026-05-19T20:01:19+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20158","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:46:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Existing visual attribution methods often fail to identify the visual evidence used by LVLMs in chest X-ray reasoning, while MedFocus using unbalanced optimal transport and targeted interventions substantially outperforms them across multiple models and settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19958","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TravExplorer: Cross-Floor Embodied Exploration via Traversability-Aware 3-D Planning","primary_cat":"cs.RO","submitted_at":"2026-05-19T15:11:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TravExplorer couples zero-shot semantic guidance with traversability-aware 3-D planning to enable cross-floor object navigation in unseen indoor environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19528","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs","primary_cat":"cs.CV","submitted_at":"2026-05-19T08:30:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes an equation-anchored tool-use method for MLLMs that writes the pinhole back-projection equation in Chain-of-Thought and substitutes retrieved camera intrinsics and depths to achieve robustness in 3D object detection and visual grounding under rescaled intrinsics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18481","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models","primary_cat":"cs.AI","submitted_at":"2026-05-18T14:33:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OCCAM discovers open-set visual concepts, estimates causal contributions via object-level interventions on black-box vision models, and induces a global concept ontology from aggregated dataset evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18150","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework","primary_cat":"cs.AI","submitted_at":"2026-05-18T09:55:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ConceptAgent is a black-box multi-agent system that awakens erased concepts in diffusion models by initializing denoising trajectories from surrogate-guided noisy states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17823","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-18T03:55:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A foveated VLM trained for scene comprehension produces human-like fixations, outperforming models trained for search, classification, or with altered peripheral vision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17743","ref_index":67,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation","primary_cat":"cs.CV","submitted_at":"2026-05-18T01:52:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoASE++ combines activation sparsity experts with domain-adaptive on-policy distillation to achieve state-of-the-art continual test-time adaptation on image classification and segmentation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17630","ref_index":3,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-17T19:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SegRAG is a training-free retrieval-augmented framework that extracts class-specific point prompts from a filtered DINOv3 feature bank to boost SAM3 semantic segmentation performance on standard and agricultural benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17531","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification","primary_cat":"cs.CV","submitted_at":"2026-05-17T16:30:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17179","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning","primary_cat":"cs.CV","submitted_at":"2026-05-16T22:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16951","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing","primary_cat":"cs.CV","submitted_at":"2026-05-16T12:05:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16743","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LACE: Latent Visual Representation for Cross-Embodiment Learning","primary_cat":"cs.RO","submitted_at":"2026-05-16T01:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16672","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-Object Tracking Consistently Improves Wildlife Inference","primary_cat":"cs.CV","submitted_at":"2026-05-15T22:13:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Applying multi-object tracking to fuse softmax probabilities across frames in camera trap data yields weighted F1-score gains of 5.1%, 3.1%, and 2.0% over standalone classifiers on three datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16079","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation","primary_cat":"cs.CV","submitted_at":"2026-05-15T15:43:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15942","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:27:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decomposed Vision-Language Alignment framework factorizes prompts into concept and attribute tokens with Feature-Gated Cross-Attention for better compositional generalization in fine-grained open-vocabulary segmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15764","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions","primary_cat":"cs.CV","submitted_at":"2026-05-15T09:24:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15497","ref_index":135,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AnyAct: Towards Human Reenactment of Character Motion From Video","primary_cat":"cs.CV","submitted_at":"2026-05-15T00:23:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AnyAct generates editable human reenactments from character videos via conditional motion generation from transferable sparse local 2D articulated cues, with designs for human-only supervision and global-local decoupling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15397","ref_index":105,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest","primary_cat":"cs.CV","submitted_at":"2026-05-14T20:30:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the ELDOR UAV dataset and four benchmark tasks for semantic segmentation and classification of mining disturbances and ecological recovery in rainforest imagery.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Sample-F1, which indicates that pixel-level supervision is more helpful for balanced recognition across classes and for recovering complete label sets in ambiguous mixed patches. 5.4 Results on VLM-based Recognition We evaluate VLM-based recognition on ELDOR across the train, validation, and test splits using RemoteCLIP [51], GeoRSCLIP [52], DOFA-CLIP [103], RS-LLaV A [54], GeoChat [53], VHM [55], RemoteSAM [104], and SAM3 [105]. See detailed protocols and results in Appendix B.4 and C.5. 8 Table 5: Selected VLM-based recognition results across splits. Overall, generative VLMs provide the strongest image-level performance, while SAM3 also remains competitive. Here, SAM3† denotes the version using detailed class-description prompts. See Appendix B.4 for the protocol definitions."},{"citing_arxiv_id":"2605.15186","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14696","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EponaV2: Driving World Model with Comprehensive Future Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-14T11:12:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"planning should also utilize both the 3D geometry and semantic information of the environment. To this end, we introduce a simple yet effective perception-free mechanism: the prediction of future depth and semantic maps. By enforcing future depth prediction, EponaV2 comprehensively captures the 3D geometry and motion dynamics of surrounding objects. Furthermore, forecasting future feature maps derived from large-scale segmentation models [6] imparts a profound semantic understanding of the scene, also ensuring that EponaV2 remains focused on elements critical to driving decisions. Without relying on manual perception labels, the comprehensive forecasting approach builds a strong real-world understanding and future reasoning ability for EponaV2. Consequently, the inferred future representations yield rich and actionable information for trajectory planning, substantially"},{"citing_arxiv_id":"2605.14579","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-14T08:49:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Med-DisSeg uses a dispersive loss on batch representations plus adaptive multi-scale decoding to achieve state-of-the-art fine-grained segmentation on five medical imaging datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14552","ref_index":32,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LiWi: Layering in the Wild","primary_cat":"cs.CV","submitted_at":"2026-05-14T08:30:34+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14534","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media","primary_cat":"cs.CV","submitted_at":"2026-05-14T08:16:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PROVE proposes RC metrics for perceptual removal coherence and releases PROVE-Bench to better align automatic scores with human judgments on object removal tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14274","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL","primary_cat":"cs.CV","submitted_at":"2026-05-14T02:18:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13838","ref_index":205,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:58:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13798","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:20:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VoxCor creates reusable volumetric features from frozen 2D ViT models by combining triplanar inference with a closed-form weighted partial least squares projection, enabling direct voxel correspondence across modalities without training or registration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13632","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T14:58:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"While such designs improve efficiency, they may also reduce the transparency and fine-grained controllability of the reasoning process compared with explicit spatially grounded intermediate representations. Interactive Perception and Visual Prompting.Recent work on interactive perception has substantially improved the spatial grounding ability of foundation models. In computer vision, the SAM family [19,26,5] demonstrates that simple geometric prompts can support strong zero-shot segmentation, while subsequent systems such as T-Rex2 [15] and Rex-Omni [14] extend this paradigm to open-vocabulary and interac- tive object detection with both text and visual prompts. In parallel, multimodal large language models (MLLMs) [7,1,28] have also become increasingly capable of fine-grained visual grounding."},{"citing_arxiv_id":"2605.13428","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SID: Sliding into Distribution for Robust Few-Demonstration Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-13T12:22:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"learned from a small set of canonicalized demonstrations and (ii) an egocentric execution policy. We first canonicalize demonstrations, yielding a consistent motion manifold that suppresses scene- and camera-specific variation. In practice, this canonicalization can be supported by modern 6D pose estimation and tracking [46] and object-level segmentation [2]. From the resulting manifold, we learn a continuous motion field over monotonic approach-phase segments, implemented as a gradient-descent-style dynamical system that attracts states toward the demonstrated support. At test time, inte- grating the field produces large corrective motions when far from demonstrations and naturally vanishes near convergence,"},{"citing_arxiv_id":"2605.13122","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-13T07:48:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13105","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T07:15:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and lighting shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13047","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency","primary_cat":"cs.CV","submitted_at":"2026-05-13T06:11:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLMs exhibit size, center, and saliency biases in scene understanding, relying less on people than humans do, with size bias as a key driver of divergence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16393","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-12T11:56:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ViTC-UNet adapts frozen ViT representations to biomedical semantic segmentation by conditioning a UNet via learnable tokens and two-way attention decoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11951","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T11:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bimanual tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"lation tasks [18, 28, 30]. Recent studies show that integrating large pre-trained models into robotic systems substantially im- proves perception, decision-making, and control, with growing interest in agentic frameworks [26, 56]. In such systems, per- ception agents powered by Vision-Language Models (VLMs), including CLIP [40], Grounding DINO [35], and SAM3 [6], enable open-vocabulary recognition and segmentation for flex- ible object manipulation in dynamic environments. Reasoning and planning agents based on Large Language Models (LLMs) or code-generation frameworks such as ProgPrompt [43] and Code-as-Policy [32] further allow robots to generate task plans and actions from high-level language instructions."},{"citing_arxiv_id":"2605.11818","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition","primary_cat":"cs.CV","submitted_at":"2026-05-12T09:09:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11756","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Focusable Monocular Depth Estimation","primary_cat":"cs.CV","submitted_at":"2026-05-12T08:30:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"fidelity while preserving coherent global scene geometry. To study FDE, we proposeFocusDepth, a prompt-conditioned monocular relative depth estimation framework designed to balance local target sensitivity with global geometric coherence. Our key intuition is to combine the complementary priors of promptable segmentation and monocular depth foundation models: Segment Anything Model 3 (SAM3) [ 4] provides prompt-grounded spatial selectivity for identifying user-specified target regions, while the Depth Anything family (DAs), such as DA2 [ 34] and DA3 [ 16], provides a strong pretrained prior over dense scene geometry. However, directly fusing these models is nontrivial, as SAM3 is optimized for 2D prompt-driven localization whereas DAs are optimized for globally coherent 3D geometry."},{"citing_arxiv_id":"2605.11616","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances","primary_cat":"cs.CV","submitted_at":"2026-05-12T06:44:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AFFORDMEM improves AP50 by 3.23-3.7 points on SceneFun3D splits by using a reusable cross-scene affordance memory bank and in-scene spatial memory to guide VLMs toward actionable 3D regions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"We additionally report mean IoU (mIoU) and Average Recall at the same thresholds (AR25,AR50). Implementation details.All language-based stages of AFFORDMEMuse the same Qwen3-VL- 32B [2] model for query parsing, VLM grounding, and instance selection, ensuring that performance differences arise from the proposed memory mechanisms rather than model ensembling. Segmentation uses SAM3 [4] on a single NVIDIA A100 80 GB GPU. Depth consistency filter: k=3, τmin=0.05 m. Multi-view voting: ρ0=0.70, θvis=3, DBSCAN ε=0.03 m, merge IoU threshold θIoU=0.30, recall threshold θrec=0.60. Frame selection: K=20, w1=w2=0.5. All hyperparameters are fixed across all scenes and splits without scene-specific tuning. For cross-split evaluation, the memory bank for"}],"limit":50,"offset":0}