{"total":17,"items":[{"citing_arxiv_id":"2606.07861","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-06-05T21:49:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07779","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do Vision-Language Models See Dwarf Galaxies the Way We Do?","primary_cat":"astro-ph.IM","submitted_at":"2026-06-05T18:45:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Zero-shot VLMs reproduce aggregate human annotations on dwarf galaxy detection but exhibit high per-example variability and unreliable self-reported confidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04046","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation","primary_cat":"cs.CV","submitted_at":"2026-06-02T07:50:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SceneDiver introduces a coarse-to-fine focus plan generation approach for VLMs that constructs holistic scene graphs then iteratively decomposes tasks, plus a distillation adapter for VLAs, to reduce visual hallucinations in embodied AI benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23189","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Empirical Bayes Conformal Prediction for Vision and Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-22T03:17:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical Bayes conformal prediction converts score variability into r-value nonconformity scores that preserve target coverage while reducing inclusion of high-variance false candidates in image classification, CLIP VLMs, and LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21479","ref_index":122,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:58:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20837","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T07:27:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17187","ref_index":141,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media","primary_cat":"cs.CL","submitted_at":"2026-05-16T22:52:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09090","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations","primary_cat":"cs.CV","submitted_at":"2026-05-09T17:54:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Controlled counterfactual perturbations reveal no correlation between embedding cosine similarity and approximation behavior in two visual grounding models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Visual Grounding (VG), also referred to as Referring Ex- pression Comprehension (REC), requires a model to local- ize an object in an image given a text description. Recent Transformer-based architectures have achieved strong per- formance on standard benchmarks such as RefCOCO [38], RefCOCO+ [38], RefCOCOg [20], and Flickr30K Enti- ties [27]. Modern vision-language models [16] typically rely on large-scale pretrained models both for vision [1, 25] and language [4, 23] to operate in a shared multimodal em- bedding space. Despite this progress, current evaluation protocols rely on a strong assumption: the object described in the refer- ring expression is always present in the image. As a con- sequence, the ability of models to handle counterfactual or"},{"citing_arxiv_id":"2604.19083","ref_index":152,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety","primary_cat":"cs.CR","submitted_at":"2026-04-21T04:52:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11589","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLLM-as-a-Judge Exhibits Model Preference Bias","primary_cat":"cs.CV","submitted_at":"2026-04-13T15:04:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09532","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise","primary_cat":"cs.CV","submitted_at":"2026-04-10T17:48:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisPrompt improves prompt learning robustness under label noise by injecting instance-level visual semantics via attention and adaptive modulation while freezing the VLM backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03066","ref_index":163,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems","primary_cat":"eess.SY","submitted_at":"2026-04-03T14:40:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"as interfaces that connect human intent, process knowledge, and automated planning in remanufacturing scenarios. 2)Vision-Language Model (VLM):VLMs are a class of multimodal foundation models designed to jointly process vi- sual and textual information [161], [162]. With the emergence of numerous studies in recent years, the architecture of VLMs has undergone multiple evolutions [163]. Early works, such as CLIP [164] and BLIP [165], employ a vision encoder and a text encoder for pretraining via contrastive learning, which pulls paired images and texts closer while pushing unpaired examples farther apart in the embedding space. Later approaches, such as LLaV A [166] and Flamingo [167], replace the text encoder with an LLM for text understanding and"},{"citing_arxiv_id":"2511.23253","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture","primary_cat":"cs.AI","submitted_at":"2025-11-28T15:02:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Vision-Language Models (VLMs) that couple visual en- coders with large language models have rapidly ad- vanced general-purpose multimodal understanding [22], with CLIP [32] establishing scalable vision-language rep- resentations for strong zero-shot transfer. Proprietary mod- els like GPT-4o [1] and Gemini [37] push multimodal reasoning at scale, while open-weight models such as DeepSeek [24] and Qwen-VL [3] lower the barrier for do- 2 Table 1. Comparison of AgriCoT with existing agriculture datasets or benchmark. CoT shows whether the dataset contains Chain-of- Thought reasoning. Abbreviations adopted: #Q for the number of questions; MCQ for Multiple Choice Questions; TFQ for True-or- False Questions; OEQ for Open-Ended Questions; LFQ for Long-"},{"citing_arxiv_id":"2511.18373","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2025-11-23T09:43:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13073","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey","primary_cat":"cs.RO","submitted_at":"2025-08-18T16:45:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Li, and W. X. Zhao, \"A survey of vision-language pre-trained models,\" in IJCAI, 2022. [52] J. Zhang, J. Huang, S. Jin, and S. Lu, \"Vision-language models for vision tasks: A survey,\" TP AMI, vol. 46, pp. 5625-5644, 2024. [53] J. Wu, W. Gan, Z. Chen, S. Wan, and P . S. Yu, \"Multimodal large language models: A survey,\" in BigData, 2023, pp. 2247-2256. [54] Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, and G. Shi, \"A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges,\" arXiv:2501.02189, 2025. [55] D. Shu, H. Zhao, J. Hu, W. Liu, A. Payani, L. Cheng, and M. Du, \"Large vision-language model alignment and misalignment: A survey through the lens of explainability,\"arXiv:2501."},{"citing_arxiv_id":"2503.21460","ref_index":153,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model Agent: A Survey on Methodology, Applications and Challenges","primary_cat":"cs.CL","submitted_at":"2025-03-27T12:50:17+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Multi-Agent System Benchmarking. TheAgentCompany [151] pioneered enterprise-level assessments using simulated software company environments to test web interaction and code collaboration capabilities. Comparative analysis like AutoGen and CrewAI [152] establishes methodological stan- dards through ML code generation challenges. Large Visual Language Model Survey [153] systematizes over 200 multi- modal benchmarks. For multi-agent collaboration, MLRB [154] designs 7 competition-level ML research tasks, and MLE-Bench [144] evaluates Kaggle-style model engineering through 71 real-world competitions. These efforts collectively establish rigorous evaluation protocols for emergent agent coordination capabilities. 3.2 Tools"},{"citing_arxiv_id":"2411.18279","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model-Brained GUI Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-27T12:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Zhanget al., [54] A survey on the memory of LLM-based agents. ✓ Shen [13] A survey of the tool usage in LLM agents. ✓ Changet al., [55] A survey on evaluation of LLMs. ✓ Liet al., [56] A survey on benchmarks multimodal applications. ✓ Liet al., [57] A survey on benchmarking evaluations, applications, and challenges of visual LLMs. ✓ Huang and Zhang [58]A survey on evaluation of multimodal LLMs. ✓ Xieet al.,. [59] A survey on LLM based multimodal agent. ✓ ⃝ Duranteet al., [60] A survey of multimodal interaction with AI agents. ✓ ⃝ Wuet al., [61] A survey of foundations and trend on multimodal mobile agents. ✓ ✓ Wanget al., [62] A survey on the integration of foundation models with GUI agents. ✓ ✓"}],"limit":50,"offset":0}