{"total":44,"items":[{"citing_arxiv_id":"2606.30288","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context","primary_cat":"cs.CV","submitted_at":"2026-06-29T13:30:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03345","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data","primary_cat":"cs.CV","submitted_at":"2026-06-02T08:54:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PercepT discovers perceptual topic clusters from vision-language data via unsupervised training and maps images to them with attention pooling, reporting silhouette 0.97 and AUC 0.94 on ArtELingo.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25194","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Localization then Neutralization: Gradient-guided Token Suppression against Visual Prompt Injection Attack","primary_cat":"cs.LG","submitted_at":"2026-05-24T17:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Gradient Token Masking localizes critical adversarial image tokens via hidden-state gradient norms and masks them to neutralize prompt injection attacks in multimodal LLMs with one forward-backward pass.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13113","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles","primary_cat":"cs.CY","submitted_at":"2026-05-13T07:25:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[111] face gender prediction gender Encoder [111] total variation distance between gender distributions Shenet al. [93] face gender prediction gender classification femalev.s.male on proportions Liet al. [58] face gender prediction gender classification [82] largest relative deviation of genders from the uniform distribution D'Incàet al. [24] scene gender prediction gender VQA [62, 63] entropy of two gender probability distributions Chinchureet al. [17] face gender prediction gender VQA [16] MAD between the images from original and counterfactual prompts Wuet al. [109] scene embedding similar- ity embeddingsT2I model [86], en- coder [14, 42, 82, 97] similarity between neutral and genders on intermediate embed- dings during generation and image embeddings Wuet al. [109] scene downstream task objects T2I model [86], visual"},{"citing_arxiv_id":"2605.11716","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T08:05:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10582","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing","primary_cat":"cs.CR","submitted_at":"2026-05-11T13:54:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10989","ref_index":69,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SURGE: Surrogate Gradient Adaptation in Binary Neural Networks","primary_cat":"cs.LG","submitted_at":"2026-05-09T09:52:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SURGE proposes a dual-path gradient compensator and adaptive gradient scaler to mitigate gradient mismatch in binary neural network training via auxiliary backpropagation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03927","ref_index":8,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-05T16:19:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StateVLM uses an Auxiliary Regression Loss on box decoder outputs to boost VLMs' numerical accuracy for object and state localization, yielding 1.6% average gains on RefCOCO variants and 5.2% on the new OSAR benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01391","ref_index":12,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VISTA: Video Interaction Spatio-Temporal Analysis Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-02T11:28:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISTA is a new ~12K-pair benchmark and taxonomy for open-set multi-entity spatio-temporal understanding in VLMs that decomposes videos into entities, actions, and relational dynamics for multi-axis diagnostics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20366","ref_index":135,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation","primary_cat":"cs.CV","submitted_at":"2026-04-22T09:02:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11627","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs","primary_cat":"cs.CV","submitted_at":"2026-04-13T15:38:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"theoretical efficiency doesn't translate to real-world perfor- mance, severely limiting their practical use. Visual Token Reduction in MLLMs.Some preliminary studies mainly focus on Vision Transformers [4, 36, 65] and KV cache compression [44, 76, 118] for LLMs. In the context of MLLMs, common methods like Q-Former [39], resampler [13] and pooling [8] are widely used during the training phase to reduce visual tokens. Recently, some stud- ies tried to handle the token reduction problem in more delicate ways [1, 28-30, 68, 83, 91, 104]. In particu- lar, training-free methods mainly leverage task-orientated attention importance [11, 49, 94, 116], or inherent vi- sual redundancy [34, 84, 101], compromising efficiency"},{"citing_arxiv_id":"2604.10527","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"STORM: End-to-End Referring Multi-Object Tracking in Videos","primary_cat":"cs.CV","submitted_at":"2026-04-12T08:43:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08014","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding","primary_cat":"cs.CV","submitted_at":"2026-04-09T09:14:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In light of this, we explore MLLMs to inject stronger semantic understanding into the STVG task, since their extensive pretrained knowledge allows for better interpretation of complex language and adaptation to open-world scenarios. 2.2 MLLMs for Grounding Recent advances in MLLMs [1, 3, 16, 17, 35, 63, 69] have yielded no- table progress in visual grounding tasks. MiniGPT [6], LLaVA [65], Qwen3-VL [3] and InternVL [69] concentrate on spatial grounding within static images, wherein the model identifies objects refer- enced in textual input-typically by generating bounding box co- ordinates or selecting from region proposals. For video grounding, certain works based on the aforementioned MLLM [5, 7, 18, 19, 47] incorporate temporal grounding abilities, linking textual descrip-"},{"citing_arxiv_id":"2604.05649","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis","primary_cat":"cs.CV","submitted_at":"2026-04-07T09:54:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05079","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration","primary_cat":"cs.CV","submitted_at":"2026-04-06T18:30:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"checking, and targeted refinement to improve the reliabil- ity of long-video question answering. • Experiments on four long-video benchmarks show con- sistent 5.5%-11.5% gains over baselines, demonstrating the effectiveness and robustness of SV Agent. 2. Related Work 2.1. Video Multimodal Large Language Models Multimodal Large Language Models (MLLMs) [3, 18, 19, 37, 54, 63] have been extended from images to videos, giv- ing rise to Video MLLMs [7, 14, 15, 48]. Most approaches sample frames and encode them as visual tokens interleaved with text, providing a unified interface for image and video inputs [29, 58, 60]. However, in long videos, sparse evi- dence and long temporal spans make it difficult to preserve"},{"citing_arxiv_id":"2604.04579","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation","primary_cat":"cs.CV","submitted_at":"2026-04-06T10:25:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03454","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles","primary_cat":"cs.CV","submitted_at":"2025-12-03T05:14:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ThinkDeeper introduces a world-model-based reasoning step that predicts future spatial states to improve multimodal visual grounding for autonomous vehicles, achieving top results on Talk2Car and other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.21976","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2025-09-26T07:01:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Geo-R1 uses reasoning-centric reinforcement fine-tuning to improve few-shot performance and generalization in geospatial referring expression understanding over supervised baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.20325","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs","primary_cat":"cs.CL","submitted_at":"2025-08-28T00:07:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.11011","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?","primary_cat":"cs.CV","submitted_at":"2025-08-14T18:23:09+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.21210","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toward Generalizable Forgery Detection and Reasoning","primary_cat":"cs.CV","submitted_at":"2025-03-27T06:54:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.16549","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems","primary_cat":"cs.CV","submitted_at":"2025-03-19T11:46:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.05067","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-09T08:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.06224","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks","primary_cat":"cs.RO","submitted_at":"2024-12-09T05:55:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"TABLE X: Ablation study on training strategy and archi- tecture. For each ablation type, we retrain the entire model and evaluate its performance across four navigation tasks. strategy and token merging designs (Tab. X). Our results indicate that the absence of <NAV> and VQA data leads to a performance decline across all tasks, similar findings can be found in [ 14, 111]. Notably, the performance drop is most obviously in EQA, as the lack of <NAV> special token makes the model misinterpret whether it should answer questions or output actions. Additionally, without VQA data, the agent's ability to answer questions drops significantly, almost rendering it incapable of correctly answering questions. We believe this"},{"citing_arxiv_id":"2410.17434","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding","primary_cat":"cs.CV","submitted_at":"2024-10-22T21:21:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.12514","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2024-09-19T07:10:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"TinyVLA achieves faster inference and higher data efficiency than OpenVLA on robotic manipulation tasks by initializing from high-speed multimodal models and adding a diffusion policy decoder, without any pre-training phase.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.16213","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation","primary_cat":"cs.CV","submitted_at":"2024-08-29T02:12:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"M4CXR is a multi-modal large language model that performs multiple tasks in chest X-ray analysis including report generation with claimed SOTA clinical accuracy using chain-of-thought prompting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.13257","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?","primary_cat":"cs.CV","submitted_at":"2024-08-23T17:59:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.14194","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model","primary_cat":"cs.CV","submitted_at":"2024-06-20T10:56:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLBiasBench is a new large-scale benchmark with 128,342 samples covering nine social bias categories plus two intersectional ones to evaluate biases in LVLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.09411","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding","primary_cat":"cs.CV","submitted_at":"2024-06-13T17:59:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.07353","ref_index":82,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toxic Memes: A Survey of Computational Perspectives on the Detection and Explanation of Meme Toxicities","primary_cat":"cs.CL","submitted_at":"2024-06-11T15:22:48+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A PRISMA-based survey of 158 computational works on toxic meme detection introduces a new toxicity taxonomy and a framework linking target, intent, and conveyance tactics while noting trends in LLMs and cross-modal methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Assuch,severalemergingtrendsandareasofresearchhavenotbeenthoroughlyexploredinexisting surveys. These include the utilization of background knowledge and the emphasis on explainability in computational approaches[79,80],theincreasinguseofLLMsforvarioustaskssuchasdetectinghatefulness,misogyny,offensiveness, sarcasm,harmfulness,andspecificharmfulmemes[ 79,81],shiftstowardsmoresophisticatedevaluationmethodologies [82,83,84,85], novel approaches for generating toxic memes from benign prompts [86], and the emergence of new datasets, including GOAT-Bench and datasets in multiple languages beyond English [13, 87, 88]. 4. Methodology Figure 2: PRISMA 2020 flow diagram for systematic reviews on *SCOPUS and Web of Science (WOS) databases. Registers refers to SCOPUS preprints."},{"citing_arxiv_id":"2404.12390","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BLINK: Multimodal Large Language Models Can See but Not Perceive","primary_cat":"cs.CV","submitted_at":"2024-04-18T17:59:54+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.14624","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?","primary_cat":"cs.CV","submitted_at":"2024-03-21T17:59:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.03766","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MobileVLM V2: Faster and Stronger Baseline for Vision Language Model","primary_cat":"cs.CV","submitted_at":"2024-02-06T07:16:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.00253","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Hallucination in Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2024-02-01T00:33:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.16420","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model","primary_cat":"cs.CV","submitted_at":"2024-01-29T18:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.10935","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","primary_cat":"cs.HC","submitted_at":"2024-01-17T08:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.16886","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices","primary_cat":"cs.CV","submitted_at":"2023-12-28T08:21:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"InstructBLIP [30] Vicuna-13B 224 129M 1.2M 49.5 63.1 50.7 78.9 1212.8 - Shikra [15] Vicuna-13B 224 600K 5.5M - - - - - 58.8 mPLUG-Owl [126] LLaMA-7B 224 2.1M 102K - - - - 967.3 49.4 IDEFICS-9B [64] LLaMA-7B 224 353M 1M 38.4 - 25.9 - - 48.2 IDEFICS-80B [64] LLaMA-65B 224 353M 1M 45.2 - 30.9 - - 54.5 Qwen-VL [5] Qwen-7B 448 1.4B 50M 59.3 67.1 63.8 - 1487.6 38.2 MiniGPT-v2 [14] LLaMA-7B 448 23M 1M 60.3 - - - - 12.2 LLaV A-1.5 [74] Vicuna-7B 336 558K 665K 62.0 66.8 58.2 85.9 1510.7 64.3 MobileVLM 1.7B MobileLLaMA 1.4B 336 558K 665K 56.1 54.7 41.5 84.5 1196.2 53.2 MobileVLM 1.7B w/ LoRA MobileLLaMA 1.4B 336 558K 665K 57.0 53.1 42.3 86.0 1143.7 50.4 MobileVLM 3B MobileLLaMA 2.7B 336 558K 665K 59.0 61.0 47.5 84.9 1288.9 59.6"},{"citing_arxiv_id":"2311.17005","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MVBench: A Comprehensive Multi-modal Video Understanding Benchmark","primary_cat":"cs.CV","submitted_at":"2023-11-28T17:59:04+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.12793","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ShareGPT4V: Improving Large Multi-Modal Models with Better Captions","primary_cat":"cs.CV","submitted_at":"2023-11-21T18:58:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"notable instance is CLIP [45], which exemplifies the align- ment of visual and textual modalities through contrastive learning on extensive image-text pairs. A series of works [26, 27] were improved upon CLIP by employing refined data strategies for more diverse data, they have been effec- tive for basic visual tasks [28, 32, 59] but less so for com- plex tasks like visual question answering. MiniGPT-4 [5], leveraging an LLM [8] and a visual encoder [14], has shown proficiency in image-text dialogues through pre-training alignment and instruction fine-tuning. Subsequent research [3, 6, 10, 25, 31, 43, 57] has further enhanced LMMs by fo- cusing on the quality and diversity of pretraining and fine- tuning data. For instance, LLaV A [31] and InstructBLIP"},{"citing_arxiv_id":"2311.07575","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models","primary_cat":"cs.CV","submitted_at":"2023-11-13T18:59:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.07397","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation","primary_cat":"cs.CL","submitted_at":"2023-11-13T15:25:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.15112","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition","primary_cat":"cs.CV","submitted_at":"2023-09-26T17:58:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer generates articles with seamlessly integrated images and achieves state-of-the-art results on vision-language benchmarks including MME, MMBench, and Seed-Bench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ral Information Processing Systems (NeurIPS) , 33:1877- 1901, 2020. 2, 3 [8] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558-3568, 2021. 2, 4 [9] Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 3 [10] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao."},{"citing_arxiv_id":"2306.14565","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2023-06-26T10:26:33+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"knowledge level hallucination or images that are not from the Visual Genome dataset, we use the groundtruth answers as a reference and compare them with predictions (Fig. 7 in the appendix). 6 E XPERIMENT 6.1 I MPLEMENTATION SETUP Baselines. We evaluate the zero-shot performance of 5 recently released LMMs: (1) MiniGPT4; (2) MiniGPTv2; (3) InstructBLIP; (4) Multimodal-GPT (MMGPT); (5) mPLUG-Owl; (6) LLaV A; (7) LLaV A 1.5. All models above have been tuned on their collected visual instruction data. Training Details. As for MiniGPT4, we initialize from its checkpoint of the first pretraining stage. Then we instruct-tune the model on LRV-Instruction with the linear projection layer as the only learnable module."}],"limit":50,"offset":0}