MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
hub Canonical reference
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Canonical reference. 70% of citing Pith papers cite this work as background.
abstract
Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GeMoE adaptively sets the number of experts per token via gating entropy, retaining 99.5% of static-routing performance while raising average sparsity by 36.5%.
A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
GeoSearcher introduces anchor-centric reasoning supervised fine-tuning and process-faithful group relative policy optimization to improve MLLM-based remote sensing visual grounding.
ReShift is a reasoning-level backdoor framework for VLMs that uses poisoned data construction and joint optimization to shift CoT trajectories on trigger while preserving surface coherence.
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
GAVEL introduces a joint task, dataset, and benchmark for verifying, explaining, and localizing caption-image misalignments, with a supervised baseline that improves grounding and explanation metrics over strong closed-source models.
OPPO applies RL with an Omni-Perception Reward and masked-input KL loss to boost cue utilization and suppress hallucinations in emotion reasoning MLLMs, claiming SOTA results on MER-UniBench, MME-Emotion, and MEP-Bench.
GPS framework adds self-guided reasoning modules to lightweight VLMs for fine-grained action understanding, claiming performance near GPT-4o with better factual accuracy on a custom CAP-based dataset.
SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
SURGE proposes a dual-path gradient compensator and adaptive gradient scaler to mitigate gradient mismatch in binary neural network training via auxiliary backpropagation.
VISTA is a new ~12K-pair benchmark and taxonomy for open-set multi-entity spatio-temporal understanding in VLMs that decomposes videos into entities, actions, and relational dynamics for multi-axis diagnostics.
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
ThinkDeeper introduces a world-model-based reasoning step that predicts future spatial states to improve multimodal visual grounding for autonomous vehicles, achieving top results on Talk2Car and other benchmarks.
MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.
citing papers explorer
-
GAVEL: Grounded Caption Error Verification and Localization
GAVEL introduces a joint task, dataset, and benchmark for verifying, explaining, and localizing caption-image misalignments, with a supervised baseline that improves grounding and explanation metrics over strong closed-source models.
-
Toxic Memes: A Survey of Computational Perspectives on the Detection and Explanation of Meme Toxicities
A PRISMA-based survey of 158 computational works on toxic meme detection introduces a new toxicity taxonomy and a framework linking target, intent, and conveyance tactics while noting trends in LLMs and cross-modal methods.
-
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.
-
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.