{"total":13,"items":[{"citing_arxiv_id":"2605.20942","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-20T09:28:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A graph-grounded Combined Road Substrate framework generates traceable QA pairs from road maps to improve small VLMs on compositional road reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21541","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks against Closed-Source MLLMs","primary_cat":"cs.CR","submitted_at":"2026-05-20T08:15:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FRA-Attack uses high-pass DCT feature alignment and frequency-domain gradient regularization to boost adversarial transferability across 15 MLLMs from 7 vendors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20525","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-19T21:54:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19506","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-19T08:01:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12960","ref_index":38,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DiM\\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging","primary_cat":"cs.CL","submitted_at":"2026-05-13T03:50:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiM3 is a direction- and magnitude-aware merging method that composes heterogeneous multilingual and multimodal updates in LLM backbones, outperforming baselines on 57-language benchmarks while retaining multimodal performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09948","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"models from natural language supervision. InInternational conference on machine learning, pages 8748-8763. PmLR, 2021. [6] Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. [7] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716-23736, 2022. [8] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image"},{"citing_arxiv_id":"2605.08985","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?","primary_cat":"cs.CV","submitted_at":"2026-05-09T15:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"compression rate, exploring dynamic, content-aware token reduction mechanisms within the encoder remains an exciting direction for future research. Together, these results suggest that aggressive token 9 reduction can be performed inside the vision encoder without sacrificing fine-grained perception, offering a practical path toward more scalable multimodal foundation models. References [1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716-23736, 2022. [2] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and"},{"citing_arxiv_id":"2605.07308","ref_index":2,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-08T06:17:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This project was supported by National Youth Talent Sup- port Program (8200800081) and National Natural Science Foundation of China (62376006). References [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. [2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716-23736, 2022. [3] Lucas Beyer, Andreas Steiner, Andr'e Susano Pinto, Alexan-"},{"citing_arxiv_id":"2604.08846","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-10T01:01:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01833","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks","primary_cat":"cs.CV","submitted_at":"2026-04-02T09:46:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"alignment [40, 60], test-time training [48, 51], and self-supervised learning [18, 15, 8]. While effective for tasks such as natural-to-synthetic image transfer, these methods are less suited for bridging different modalities. Cross-modality adaptation, such as between language and vision, introduces challenges due to structural and distributional differences. Vision-language models (VLMs) like CLIP [44] and Flamingo [3], and Multimodal LLMs [9, 29], leverage paired datasets to learn joint representations but rely heavily on aligned supervision and task-specific designs. In contrast, our work explores inherent cross-modality capabilities of LLMs, leveraging pretraining-induced biases for annotation-free, flexible adaptation without architectural changes. Modality Difference in Pretraining."},{"citing_arxiv_id":"2601.06803","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Forest Before Trees: Latent Superposition for Efficient Visual Reasoning","primary_cat":"cs.CL","submitted_at":"2026-01-11T08:30:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.01925","ref_index":118,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","primary_cat":"cs.RO","submitted_at":"2025-07-02T17:34:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"inherently compatible with interleaved visual and textual sequences, thereby enabling strong few-shot learning capabilities. LLaVA [83] represents a milestone in the development of VLM architecture, which simply links a CLIP vision encoder to the Vicuna LLM [117] via a linear projection and is trained on visual instruction-tuning 10 A Survey on Vision-Language-Action Models: An Action Tokenization Perspective data synthesized by GPT-4. LLaVA-1.5 [118] improves upon LLaVA by adopting a stronger vision encoder, replacing the linear projection with an MLP, and training on a larger dataset. The Qwen-VL family represents another prominent line of work. The initial Qwen-VL [119] combines the Qwen- 7B LLM [120] with a ViT through a position-aware cross-attention adaptor. Its specially designed input-output"},{"citing_arxiv_id":"2502.04326","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2025-02-06T18:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}