{"work":{"id":"69dffacb-bfe8-442d-be86-48624c60426f","openalex_id":null,"doi":null,"arxiv_id":"2502.13923","raw_key":null,"title":"Qwen2.5-VL Technical Report","authors":null,"authors_text":"Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al","year":2025,"venue":"cs.CV","abstract":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.","external_url":"https://arxiv.org/abs/2502.13923","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-30T14:14:45.492616+00:00","pith_arxiv_id":"2502.13923","created_at":"2026-05-09T02:27:20.922186+00:00","updated_at":"2026-06-30T14:14:45.492616+00:00","title_quality_ok":false,"display_title":"Qwen2.5-VL Technical Report","render_title":"Qwen2.5-VL Technical Report"},"hub":{"state":{"work_id":"69dffacb-bfe8-442d-be86-48624c60426f","tier":"mega_hub","tier_reason":"1,000+ Pith inbound or 100,000+ external citations","pith_inbound_count":1011,"external_cited_by_count":null,"distinct_field_count":20,"first_pith_cited_at":"2024-06-12T09:36:52+00:00","last_pith_cited_at":"2026-06-29T17:20:26+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"needed","recognition_status":"needed","updated_at":"2026-06-30T16:00:10.231373+00:00","tier_text":"mega_hub"},"tier":"mega_hub","role_counts":[{"context_role":"background","n":152},{"context_role":"baseline","n":57},{"context_role":"method","n":57},{"context_role":"dataset","n":5},{"context_role":"other","n":3}],"polarity_counts":[{"context_polarity":"background","n":146},{"context_polarity":"use_method","n":59},{"context_polarity":"baseline","n":56},{"context_polarity":"unclear","n":6},{"context_polarity":"use_dataset","n":5},{"context_polarity":"support","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T18:13:28.003986+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"2bc8cfa9-c2ce-48ef-87d4-e0de52ae9a92","orcid":null,"display_name":"Shuai Bai"},{"id":"dd80061e-c973-48f3-af57-2764aa5433c7","orcid":null,"display_name":"Keqin Chen"},{"id":"540ccffd-e57b-4849-b7e9-9cd5c0c17482","orcid":null,"display_name":"Xuejing Liu"},{"id":"3d40df98-a97a-46a5-903b-b388be7785ab","orcid":null,"display_name":"Jialin Wang"},{"id":"1f4a16b8-5401-4f0d-9de5-c7d725338b8e","orcid":null,"display_name":"Wenbin Ge"},{"id":"5788a139-4b79-4854-8880-11a56d0c51b4","orcid":null,"display_name":"Sibo Song"}]},"error":null,"updated_at":"2026-05-13T18:13:28.001713+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T18:13:27.728863+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":112},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":100},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":82},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":80},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":79},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":78},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":72},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":70},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":64},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":61},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":61},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":57},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":42},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":38},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":32},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":32},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":30},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":29},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":28},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":28},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":27},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":27},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":26},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":26}],"time_series":[{"n":25,"year":2025},{"n":387,"year":2026}]},"error":null,"updated_at":"2026-05-13T17:25:55.668312+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T18:13:27.104671+00:00"},"reader_index":{"job_type":"reader_index","status":"succeeded","result":{"note":"annotated reader requires full-text/OA fetch; shell is wired for mega hubs","status":"reader queued"},"error":null,"updated_at":"2026-06-30T11:30:01.262604+00:00"},"recognition_alignment":{"job_type":"recognition_alignment","status":"succeeded","result":{"modules":["IndisputableMonolith.Gravity.PropagationSpeed","IndisputableMonolith.Foundation.PreTemporalForcingOrder","IndisputableMonolith.Physics.LightConeCausalityFromRS","IndisputableMonolith.Cosmology.EtaBPrefactorDerivation","IndisputableMonolith.Physics.MaxwellEquationsFromRS","IndisputableMonolith.Gravity.BlackHoleEntropyFromLedger","IndisputableMonolith.Thermodynamics.FermiDirac","IndisputableMonolith.Gravity.BlackHoleHorizonStates"],"query_chars":1895},"error":null,"updated_at":"2026-06-30T11:30:01.245724+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T18:13:27.732366+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T17:25:52.717336+00:00"}},"summary":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":112},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":100},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":82},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":80},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":79},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":78},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":72},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":70},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":64},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":61},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":61},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":57},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":42},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":38},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":32},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":32},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":30},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":29},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":28},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":28},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":27},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":27},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":26},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":26}],"time_series":[{"n":25,"year":2025},{"n":387,"year":2026}]},"authors":[{"id":"3d40df98-a97a-46a5-903b-b388be7785ab","orcid":null,"display_name":"Jialin Wang","source":"manual","import_confidence":0.72},{"id":"dd80061e-c973-48f3-af57-2764aa5433c7","orcid":null,"display_name":"Keqin Chen","source":"manual","import_confidence":0.72},{"id":"2bc8cfa9-c2ce-48ef-87d4-e0de52ae9a92","orcid":null,"display_name":"Shuai Bai","source":"manual","import_confidence":0.72},{"id":"5788a139-4b79-4854-8880-11a56d0c51b4","orcid":null,"display_name":"Sibo Song","source":"manual","import_confidence":0.72},{"id":"1f4a16b8-5401-4f0d-9de5-c7d725338b8e","orcid":null,"display_name":"Wenbin Ge","source":"manual","import_confidence":0.72},{"id":"540ccffd-e57b-4849-b7e9-9cd5c0c17482","orcid":null,"display_name":"Xuejing Liu","source":"manual","import_confidence":0.72}]},"citers":{"total":1011,"items":[{"citing_arxiv_id":"2606.30577","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms","primary_cat":"cs.CV","submitted_at":"2026-06-29T17:20:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"APRIL-MedSeg is a new open-source modular toolbox that uses YAML configuration and component registries to unify multiple advanced paradigms for medical image segmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30378","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-29T14:38:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30288","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context","primary_cat":"cs.CV","submitted_at":"2026-06-29T13:30:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30220","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA","primary_cat":"cs.CV","submitted_at":"2026-06-29T12:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audit of four VideoQA benchmarks reveals text-only shortcuts in VLMs; new diagnostics Blind Gap, Visual Gain, and Shortcut Score quantify and filter visual dependence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30217","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning","primary_cat":"cs.CL","submitted_at":"2026-06-29T12:30:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30084","ref_index":127,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding","primary_cat":"cs.CV","submitted_at":"2026-06-29T10:20:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30054","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-06-29T09:45:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29915","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"H-GRPO: Permutation-Invariant Reinforcement Learning for Grounded Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-29T07:51:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"De-compositional Evidence Grounding decomposes visual reasoning into atomic sub-questions each tied to a specific image region to improve VLM performance and interpretability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29905","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StrucTab: A Structured Optimization Framework for Table Parsing","primary_cat":"cs.CV","submitted_at":"2026-06-29T07:41:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StrucTab achieves SOTA table parsing performance by unifying structural subtasks through sequential reasoning and using decomposed RL rewards in Uni-TabRL, plus a new TableVerse-5K benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29879","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LWDrive: Layer-Wise World-Model-Guided Vision-Language Model Planning for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-06-29T07:17:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LWDrive refines coarse VLM trajectories via future-frame supervision and a multi-layer Foresight Cascade Planner, reporting scores of 92.0 on NAVSIM and 89.6 on NAVSIM-v2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29808","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework","primary_cat":"cs.HC","submitted_at":"2026-06-29T05:40:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29805","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Clearer Sight, Fewer Lies: Oriented Pickup Preference Optimization for Multimodal Hallucination Mitigation","primary_cat":"cs.CV","submitted_at":"2026-06-29T05:33:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OPPO is an evidence-aware preference optimization that contrasts faithful responses under varying visual evidence strengths to reduce hallucinations in MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29705","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots","primary_cat":"cs.AI","submitted_at":"2026-06-29T02:16:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29579","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ScAle: Attention Head Scaling as a Minimal Adapter for Spatial Reasoning in Vision Language Models","primary_cat":"cs.CV","submitted_at":"2026-06-28T19:48:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ScAle learns scalar coefficients to modulate last-token attention and MLP activations in frozen VLMs, achieving up to 134.1% relative accuracy gains on spatial benchmarks with only 1K parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29462","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MIRROR: Aligning Semantic Relations from Language to Image via Gromov--Wasserstein","primary_cat":"cs.CV","submitted_at":"2026-06-28T15:39:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MIRROR derives a closed-form Semi-Inverse Gromov-Wasserstein loss to align language-derived relational priors with visual representations inside decoder-only Transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29376","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAD-GS: Learning Reliable 3D Semantic Gaussian Fields via Dynamic Geo-Semantic Anchoring","primary_cat":"cs.CV","submitted_at":"2026-06-28T12:52:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAD-GS proposes dynamic geo-semantic anchoring via SAD and GSFL to learn reliable 3D semantic Gaussian fields, reporting best performance on LERF-OVS, 3D-OVS, and Mip-NeRF360 for open-vocabulary localization and segmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29357","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking","primary_cat":"cs.CV","submitted_at":"2026-06-28T12:12:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A language dependency parsing mechanism combined with Qwen-VL enables adaptive updates to textual descriptions for improved vision-language tracking performance on benchmarks like TNL2K and LaSOT.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29350","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs","primary_cat":"cs.CV","submitted_at":"2026-06-28T11:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ST-Merge is a plug-and-play spatio-temporal token merging method that delivers 2x speedup on VLMs and 8.3x on a VLA at high resolution with minimal accuracy loss via 3D coordinate matching and positional correction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29308","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MirrorPPR: Exemplar-Based Portrait Photo Retouching","primary_cat":"cs.CV","submitted_at":"2026-06-28T10:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29213","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study","primary_cat":"cs.CL","submitted_at":"2026-06-28T05:46:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Benchmark of ten OCR systems on Devanagari shows synthetic text fails to separate models while real scans cause most to collapse, with specialized VLMs fragile and English performance not predictive of Indic results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29069","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Low-cost concept-based localized explanations: How far can we get with training-free approaches?","primary_cat":"cs.AI","submitted_at":"2026-06-27T20:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mid-scale MLLMs reach 62-88% object-level exact-match accuracy in zero-shot localized concept naming via closed-set prompting and an embedding-based Open-CoNa strategy across datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28864","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On Test-Time Scaling for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-06-27T11:12:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Small well-performing LVLMs gain the most from test-time scaling with up to 30% improvements that can match or exceed larger models, while visual information is used mainly early in reasoning chains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28862","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HKVLM: Faithful Reasoning Grounding by Binding Language Queries to a Frozen Detector","primary_cat":"cs.CV","submitted_at":"2026-06-27T11:10:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HKVLM trains only an alignment hook to bind frozen LM query embeddings to frozen detector proposals via contrastive retrieval and bipartite assignment, yielding 50-90x grounding gains and reduced hallucinations on RefCOCO and POPE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28696","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models","primary_cat":"cs.AI","submitted_at":"2026-06-27T02:43:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"COMPASS is a unified multimodal framework using a shared expert token τ_c to ground composition-intent for both perception and generation, backed by the new Comp-11 dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27988","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Visual Diffusion Reasoning with Monte Carlo Tree Search","primary_cat":"cs.CV","submitted_at":"2026-06-26T11:35:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LVDR integrates keypoint-guided MCTS into a latent diffusion reasoning model to deliver competitive skill assessment accuracy alongside explicit visual reasoning trajectories on four sports and surgical datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27974","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering","primary_cat":"cs.CV","submitted_at":"2026-06-26T11:23:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProMSA is a progressive multimodal search agent for KB-VQA that iteratively selects search tools under budgets, trained via rejection-sampling SFT then TN-GSPO RL, reporting gains on E-VQA and InfoSeek over RAG baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27872","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"S$^2$-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-26T09:13:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"S²-VLA uses a state-space model to maintain a belief state that produces dynamic gating weights for fusing visual, language, and action features, claiming better long-horizon manipulation than 7B models with only 2B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27828","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-26T08:12:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25041","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-06-23T18:01:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Wan-Streamer is a unified end-to-end Transformer for low-latency streaming audio-visual interaction using block-causal attention on interleaved multimodal tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24253","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TuringViT: Making SOTA Vision Transformers Accessible to All","primary_cat":"cs.CV","submitted_at":"2026-06-23T07:42:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18066","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment","primary_cat":"cs.LG","submitted_at":"2026-06-16T15:38:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NTRK is a reward-guided diffusion sampler that uses a whitening operator to bias the noise term toward high-reward outcomes, outperforming baselines with up to 20x fewer sampling steps on aesthetic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17871","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StepGuard: Guarding Web Navigation via Single-Step Calibration","primary_cat":"cs.AI","submitted_at":"2026-06-16T12:42:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"StepGuard framework with DDPO and CANR claims SOTA navigation and answer accuracy on web benchmarks by switching policies and triggering reflection on low-confidence steps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17816","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Conservation Laws for Modern Neural Architectures","primary_cat":"cs.LG","submitted_at":"2026-06-16T11:44:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Unified framework characterizes conservation laws for gradient flow in feedforward networks with GELU/SiLU/SwiGLU, multihead attention with positional encodings, and MoE models under various gating.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.16494","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering","primary_cat":"cs.CL","submitted_at":"2026-06-15T09:57:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09181","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA","primary_cat":"cs.CV","submitted_at":"2026-06-08T08:20:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CREDiT applies counterfactual reasoning via structural causal models to decompose video representations into causal and non-causal parts for more reliable VideoQA on datasets like NExT-GQA and SportsQA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.14752","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining","primary_cat":"cs.CV","submitted_at":"2026-06-07T09:39:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07706","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models","primary_cat":"cs.CR","submitted_at":"2026-06-05T10:10:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MLingualFC benchmark finds flowchart jailbreaks succeed at high rates for Latin-script languages but much lower rates for Punjabi in multilingual VLMs, pointing to language-dependent safety gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05778","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment","primary_cat":"cs.CV","submitted_at":"2026-06-04T07:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RED-Aes learns aesthetic changes from edit-induced image pairs and a new RED-20k dataset via three-stage relative ranking training, claiming SOTA generalization over absolute MOS regression.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04610","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Selectivity Estimation for Semantic Filters on Image Data","primary_cat":"cs.DB","submitted_at":"2026-06-03T08:48:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Semantic Histograms treat semantic image filters as implicit range queries in embedding space and use two specificity estimators whose ensemble reduces end-to-end query optimization and execution overhead by up to 86%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28344","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation","primary_cat":"cs.IR","submitted_at":"2026-06-01T23:20:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02951","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCOPE: Real-Time Natural Language Camera Agent at the Edge","primary_cat":"cs.RO","submitted_at":"2026-06-01T23:07:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SCOPE introduces an edge-deployable natural-language PTZ camera agent, a simulation benchmark, and evaluations showing that stronger small language models reduce hallucinations while perception remains the main bottleneck.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02482","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding","primary_cat":"cs.CV","submitted_at":"2026-06-01T16:52:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02171","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark","primary_cat":"cs.CV","submitted_at":"2026-06-01T12:30:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper creates InsightVQA, a 725K QA-pair benchmark with perception, grounded-understanding, and cognition levels for emotion-cognitive visual question answering, plus a 30K-sample evaluation set and InsightNet baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01113","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking","primary_cat":"cs.CV","submitted_at":"2026-05-31T09:20:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"R^3 is a zero-shot pipeline that generates reasoning traces to augment composed video queries, fuses scores via agreement-gated residual, and re-ranks candidates for the CoVR-R challenge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00959","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition","primary_cat":"cs.AI","submitted_at":"2026-05-31T02:29:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PID applied to MLLMs identifies task-specific modality interaction profiles that generalize across models, extend to tri-modal cases, and yield initial performance gains via reweighting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00622","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue","primary_cat":"cs.CV","submitted_at":"2026-05-30T08:53:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00602","ref_index":103,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training","primary_cat":"cs.CV","submitted_at":"2026-05-30T07:59:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASAP introduces an anatomy-aware semantically-adaptive pre-training method for medical volumetric vision-language models and reports state-of-the-art results on a new benchmark spanning 15 datasets and 22 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00592","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs","primary_cat":"cs.CV","submitted_at":"2026-05-30T07:44:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRISM benchmark perturbs Crello layouts into 110K samples isolating design principle violations, reveals limited sensitivity in several multimodal models, and proposes a multi-scale framework combining scorers, instruction-tuned VLMs, and prompt methods for interpretable design assessment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00576","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamic Resilient Spatio-Semantic Memory with Hybrid Localization for Mobile Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-30T06:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DREAM is a mobile manipulation system that constructs online spatio-semantic voxel memory with redundancy-aware pruning and hybrid language-vision localization, reporting higher long-horizon success rates than DynaMem in dynamic lab scenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00508","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"V-LynX: Token Interface Alignment for Video+X LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-30T03:54:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-LynX integrates novel modalities into frozen Video LLMs by aligning to an internalized continuous token manifold using unpaired unimodal data and attention/statistical matching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00435","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Detect Before You Leap: Mirage Detection in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-29T23:51:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TC-LIA detects mirage in VLMs via layer-wise image patch to question alignment in CLIP encoders, reaching 94.6-94.7% three-class accuracy and under 3% mirage rate across five domains and twelve backbones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31598","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Linear Scaling Video VLMs for Long Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-29T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StateKV is an inference-time technique that replaces quadratic self-attention prefill in video VLMs with a fixed-capacity importance-based recurrent state, keeping accuracy near full attention on long-video benchmarks without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31251","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-29T12:49:17+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ERGeoBench is a new diagnostic benchmark evaluating MLLMs on four capabilities in three progressive embodied geo-localization settings, finding that models handle high-level semantics but struggle with fine-grained perception and metric localization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31192","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Regularizing Power of Language-Training Deepfake Detectors","primary_cat":"cs.CV","submitted_at":"2026-05-29T12:01:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30884","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-29T06:17:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GUI-C² pairs a difficulty-scoring data pipeline with an area-gated coarse-to-fine RL mechanism to improve GUI grounding accuracy and training stability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11232","ref_index":78,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-29T02:36:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30557","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?","primary_cat":"cs.CV","submitted_at":"2026-05-28T20:44:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Frontier VLMs overconfidently answer spatial questions under occlusion (~30% accuracy) and perspective ambiguity (<10% accuracy) instead of abstaining, and often fail to select helpful additional views.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30344","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection","primary_cat":"cs.AI","submitted_at":"2026-05-28T17:59:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The authors create VisAnomBench with VLM-generated anomaly explanations and fine-tune VisAnomReasoner, reporting precision and F1 gains of at least 21 and 23 points on the new benchmark plus cross-benchmark improvements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30307","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grounded 3D-Aware Spatial Vision-Language Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:51:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GR3D is a VLM that combines explicit 2D, implicit 2D, and monocular 3D grounding mechanisms to improve performance on spatial understanding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30231","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchmarks without any 3D VQA data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30161","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why Far Looks Up: Probing Spatial Representation in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-28T16:18:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30062","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection","primary_cat":"cs.CV","submitted_at":"2026-05-28T15:13:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"FakeVLM-R1 combines GRPO reinforcement learning with critical-thinking CoT and a physics-annotated FakeClue++ dataset to reach claimed SOTA synthetic image detection while reducing over-rejection of real images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30027","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-28T14:50:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30011","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies","primary_cat":"cs.CV","submitted_at":"2026-05-28T14:36:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30010","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EarlyTom: Early Token Compression Completes Fast Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-28T14:36:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EarlyTom is a training-free early token compression method inside the vision encoder with decoupled spatial selection that reduces TTFT up to 2.65x and FLOPs 61% on LLaVA-OneVision-7B while keeping accuracy comparable to full tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29879","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-28T13:04:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DGSG-Mind is a hybrid 3D Gaussian dynamic scene graph system with an embodied reasoning agent for robust instance fusion, dynamic updates, and multimodal grounding in self-reconstructed maps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29662","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation","primary_cat":"cs.CV","submitted_at":"2026-05-28T09:23:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAFE-Pruner forecasts deep-layer token saliency in VLA models via semantic attention consistency and adaptive subtask detection to achieve up to 1.89x speedup with under 1.7% success rate loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29585","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-28T08:29:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WMW audits VLMs by requiring typed physical state-transition traces and using a verifier to detect inconsistencies missed by answer-only evaluation, with TraceBank as a released resource of synthetic scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29471","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"V2XCrafter: Learning to Generate Driving Scene Across Agents","primary_cat":"cs.CV","submitted_at":"2026-05-28T07:03:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V2XCrafter introduces a progressive multi-agent diffusion model with cross-agent attention to generate controllable, consistent collaborative driving scenes for V2X data augmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29198","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization","primary_cat":"cs.CV","submitted_at":"2026-05-28T00:17:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GCPO performs per-token credit assignment in discrete policy optimization by setting token advantages proportional to the difference in model predictions under positive versus negative prompts, outperforming GRPO and DAPO on text-to-image and chain-of-thought tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26122","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DocArena: Turning Raw Documents into Controllable Training Environments for Document Search Agents","primary_cat":"cs.CV","submitted_at":"2026-05-27T21:21:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DocArena automates creation of multimodal document QA training data via MLLM-based structuring and cross-page reasoning pairs, yielding agents with top retrieval and QA performance in unified tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28774","ref_index":73,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agent Explorative Policy Optimization for Multimodal Agentic Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-27T17:36:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AXPO addresses the Thinking-Acting Gap in agentic RL training by targeted resampling of tool calls in all-wrong subgroups, delivering +1.8pp gains over GRPO on nine multimodal benchmarks with an 8B model beating a 32B baseline on Pass@4.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28615","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization","primary_cat":"cs.CV","submitted_at":"2026-05-27T15:27:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BiDPO extends Diffusion DPO to bimodal preferences and adds region-aware guidance, improving compositional fidelity in text-to-image generation over prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28486","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mag-VLA: Vision-Language-Action Model for Bimanual Magnetically Actuated Microrobot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-27T13:44:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mag-VLA uses a LoRA-adapted Qwen2.5-VL-7B with a phase classifier and ACT decoder on a new teleoperated dataset to reach 90% approach and 50-80% transport success in bimanual magnetic microrobot tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28277","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-27T10:20:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MentalMap benchmark identifies a universal L3 reasoning cliff in LLMs' text-based spatial reasoning that persists across languages, scales, and prompting, and is replicated in human evaluations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28239","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-27T09:54:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A reinforced self-evolving framework (L2L) for semi-supervised referring expression segmentation that jointly optimizes the segmentation model and pseudo-labels using multimodal priors and adaptive selection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28237","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-27T09:50:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"POINav-Bench provides the first high-fidelity real-world benchmark for POI-goal VLN using 3DGS reconstructions of 126k m² with 163 POIs, supported by a Brain-Action framework and 70K real signage-entrance dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28077","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing","primary_cat":"cs.AI","submitted_at":"2026-05-27T07:32:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MACReD is a multi-agent collaborative reasoning framework for reaction diagram parsing that reports state-of-the-art F1 scores of 75.2% and 84.6% on the RxnScribe benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27960","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-27T04:54:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mags-RL uses agentic RL and a super-resolution agent for two-round reasoning in MLLMs, claiming gains on VSR, TallyQA, and GQA with a curriculum needing only 40 samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11221","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment","primary_cat":"cs.CV","submitted_at":"2026-05-27T03:48:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LAST linearizes action manifolds with Lie-algebraic mapping and discretizes them into approximately isotropic charts to align with VL semantic geometry via Gromov-Wasserstein distance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00110","ref_index":224,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling","primary_cat":"cs.CV","submitted_at":"2026-05-27T03:38:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GAM framework uses arc-length parameterization for temporal invariance and schema-affine factorization for geometric invariance to build a covariant action manifold integrated into VLA models for improved generalization from sparse data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27737","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bounded-Compute Multimodal Regression for Product-Rating Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-26T22:27:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Adapts SmolVLM2-256M-Video-Instruct for product-rating regression via MLP on decoder states, reporting 0.39 PLCC and 0.40 CES on official evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27318","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-26T17:26:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Q-GeoMem uses question-guided scoring to maintain a Fine-Grained Context Bank and Semantic-Geometric Evidence Bank, achieving SOTA on VSI-Bench and VSTI-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27315","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery","primary_cat":"cs.CL","submitted_at":"2026-05-26T17:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Real images degrade VLM alignment with human concreteness and imagery ratings relative to text-only inputs, linked to spurious visual cue sensitivity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27154","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Touch-R1: Reinforcing Touch Reasoning in MLLMs","primary_cat":"cs.CV","submitted_at":"2026-05-26T15:14:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27074","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams","primary_cat":"cs.CV","submitted_at":"2026-05-26T14:23:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IPIBench evaluates MLLMs on interactive proactive intelligence in streaming videos, identifies unstable triggering and poor coordination, and proposes the training-free IPI-Agent framework to improve performance across settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26994","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ChartAct: A Benchmark for Dynamic Chart Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-26T13:15:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ChartAct is a benchmark for dynamic chart understanding that shows most multimodal models struggle with interactive charts, with top performance at 84.5%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26691","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents","primary_cat":"cs.AI","submitted_at":"2026-05-26T08:29:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A GRPO-based RL framework with probabilistic risk minimization, disagreement-aware synergy rewards, and entropy-guided sampling enables instance-level tool selection that closes the single-oracle risk gap on medical benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26680","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-26T08:16:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DynFrame introduces tokenized learnable span-density retrieval and Segment-Decoupled GRPO in video MLLMs, achieving competitive or SOTA results on six benchmarks with 4B and 8B models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26546","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration","primary_cat":"cs.AI","submitted_at":"2026-05-26T04:53:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MobileExplorer reduces on-device GUI agent reasoning steps and latency by 23% via parallel UI exploration, structured memory, and a two-level rollback while maintaining or improving task success rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00101","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection","primary_cat":"cs.CV","submitted_at":"2026-05-26T03:18:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a commercial-model contrastive AIGC video dataset and a hybrid contrastive-MLLM detection framework claiming SOTA performance on realistic video forgery detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26104","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-25T17:58:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EVIDENT routes MLLM adaptation for video temporal grounding through entity-grounded visual evidence using an Entity Bottleneck Adapter, Entity-Binding Distillation, and Entity-to-eVidence gating to improve cross-domain robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26102","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InstructSAM: Segment Any Instance with Any Instructions","primary_cat":"cs.CV","submitted_at":"2026-05-25T17:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InstructSAM uses learnable queries in a VLM to condition SAM3 for single-pass multi-instance segmentation from arbitrary instructions, with a new Inst2Seg benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26038","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-25T17:05:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"New benchmark DRBench and four-stage supervision framework DRScaffold improve dense-scene reasoning in lightweight VLMs, with a 3B model surpassing a frozen 32B model on the benchmark while maintaining general performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26014","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-25T16:33:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"STORM teaches LVLMs to internalize spatial-temporal reasoning via bounded latent trajectories trained with generated thought videos in two stages, improving accuracy on VideoMME, MVBench and similar benchmarks while lowering inference overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27382","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AI-Model Network: Concept, Current State and Future","primary_cat":"cs.AI","submitted_at":"2026-05-25T13:46:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces the concept, vision, and hierarchical architecture of a worldwide AI-model network (AI-ModelNet) for model interconnection, sharing, and collaboration, validated via a prototype.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25829","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-25T13:28:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00096","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents","primary_cat":"cs.CV","submitted_at":"2026-05-25T13:06:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Visual CoT agents exhibit tool-use collapse where tool usage declines but task accuracy rises, and adding entropy regularization for rollout diversity produces the strongest performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25706","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker","primary_cat":"cs.CV","submitted_at":"2026-05-25T11:05:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25661","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DRM: Diffusion-based Reward Model With Step-wise Guidance","primary_cat":"cs.CV","submitted_at":"2026-05-25T10:11:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":100,"offset":0}}