{"total":10,"items":[{"citing_arxiv_id":"2605.18413","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T13:51:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"CiF is a large new civil infrastructure segmentation dataset that shows zero-shot foundation models and domain-supervised models plateau at roughly 25% mAP, establishing infrastructure inspection as an open challenge for current visual AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16999","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-16T13:51:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RAC adds ranking-aware group loss and clean-corrupted pairwise loss to RL post-training to boost both accuracy and calibration in multimodal reasoning without extra annotations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14475","ref_index":12,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-14T07:15:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The higher average-turn curve reflects that the model spends additional observations when adaptive multi-region verification is needed, rather than following a fixed single-path process. 2 Related Works Remote Sensing Vision-Language Models.Recent multimodal large language models (MLLMs), including proprietary systems such as GPT-4 [ 7] and Gemini [ 8], as well as open-source model families such as LLaV A [9, 10, 11] and Qwen-VL [12, 13, 14], have shown strong visual-language understanding capabilities. To adapt such models to aerial and satellite imagery, remote sensing VLMs align RS-specific visual encoders with large language models, leading to systems such as RSGPT [15], SkyEyeGPT [16], GeoChat [17], EarthMind [18], and EarthVL [ 19]. Recent high- resolution extensions, such as GeoLLaV A-8K [5], further enlarge the visual context for remote"},{"citing_arxiv_id":"2605.10833","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection","primary_cat":"cs.CV","submitted_at":"2026-05-11T16:49:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassing GPT-5.4.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. volume 35, pages 23716-23736, 2022. [2] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. InProceedings of the IEEE international conference on computer vision, pages 5803-5812, 2017. [3] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023. [4] Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al."},{"citing_arxiv_id":"2605.08985","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?","primary_cat":"cs.CV","submitted_at":"2026-05-09T15:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"We address two questions about this scheme. First, which connector design performs best? Second, is this post-ViT compression sufficient enough at high resolution? Table 3:Connector comparison. Downsampling Scale Resampler MLP 4× 4M 65.51 69.10 8M 64.80 71.73 16× 4M 65.87 66.64 8M 67.66 68.84 16M 70.39 70.81 Setup.Two families dominate the connector de- signs. Query-based resamplers [ 2, 1, 20] attend a small set of learnable queries to the ViT output via cross-attention. Spatial-merging MLPs [23, 8] fold neighboring patch tokens via pixel-unshuffle and project them through a lightweight feed- forward network. We first compare both under matched conditions, sharing the ViT backbone, LLM, training recipe, slice-based encoding, and"},{"citing_arxiv_id":"2605.06234","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-07T13:22:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RobotEQ is a new benchmark dataset and evaluation suite showing that current embodied AI models fall short on active social-norm compliance, especially spatial grounding, though RAG with external knowledge helps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06777","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization","primary_cat":"cs.CV","submitted_at":"2026-04-08T07:48:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.18154","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe","primary_cat":"cs.LG","submitted_at":"2025-09-16T19:41:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.01284","ref_index":13,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","primary_cat":"cs.AI","submitted_at":"2024-07-01T13:39:08+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"3, we utilize the four-dimensional metric described in section 2.2 for assessment. To avoid LMMs deduce answers from options, we introduce an extra uncertain option to mitigate this issue. Evaluation Models. We examine the performance of foundation models across two distinct cate- gories on WE-M ATH: (a) Closed-source LMMs: GPT-4o [38], GPT-4V [26], Gemini 1.5 Pro [40], Qwen-VL-Max [13], (b) Open-source LMMs: LLaV A-NeXT-110B, LLaV A-NeXT-70B [39], LLaV A- 1.6-13B, LLaV A-1.6-7B [41], DeepSeek-VL-1.3B, DeepSeek-VL-7B [42], Phi3-Vision-4.2B [43], MiniCPM-Llama3-V 2.5 [44], InternLM-XComposer2-VL-7B [45], InternVL-Chat-V1.5 [46], GLM- 4V-9B [47], LongV A [48], G-LLaV A-13B [29]. 3.1 Main Result Table 1 shows the overall performance of different LMMs on One-Step / Two-Step / Three-Step"},{"citing_arxiv_id":"2305.07895","ref_index":78,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2023-05-13T11:28:37+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"InternVL2-2B [64] Shanghai AI Lab 781 InternLM-XComposer2 [73] Shanghai AI Lab 532 GLM-4v-9B [67] Zhipu AI 776 LLaV A-Next-Vicuna-7B [71] UW-Madison 532 CogVLM2-19B-Chat [67] Zhipu AI 757 LLaV A-Next-Mistral-7B [71] UW-Madison 531 InternVL2-1B [64] Shanghai AI Lab 755 RekaEdge [74] Reka AI 506 Gemini-1.5-Pro [75] Google 754 XVERSE-V-13B [76] XVERSE 489 Ovis1.5-Llama3-8B [77] Alibaba 744 Qwen-VL-Chat [78] Alibaba 488 Qwen-VL-Plus [78] Alibaba 726 InternLM-XComposer2-1.8B [73] Shanghai AI Lab 447 MiniCPM-Llama3-V2.5 [79] OpenBMB 725 Emu2_chat [80] BAAI 436 InternVL-Chat-V1.5 [64] Shanghai AI Lab 720 DeepSeek-VL-7B [81] DeepSeek 435 Claude3-Opus [82] Anthropic 694 OmniLMM-12B [83] OpenBMB 420 RekaFlash [74] Reka AI 692 TransCore-M [84] PCI Research 405"}],"limit":50,"offset":0}