hub Canonical reference

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang · 2023 · cs.CV · arXiv 2310.09478

Canonical reference. 70% of citing Pith papers cite this work as background.

50 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 50 citing papers arXiv PDF

abstract

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 2 method 1

citation-polarity summary

background 7 baseline 2 use method 1

representative citing papers

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

GeMoE adaptively sets the number of experts per token via gating entropy, retaining 99.5% of static-routing performance while raising average sparsity by 36.5%.

Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles

cs.CY · 2026-05-13 · unverdicted · novelty 7.0

A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.

STORM: End-to-End Referring Multi-Object Tracking in Videos

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

Toward Generalizable Forgery Detection and Reasoning

cs.CV · 2025-03-27 · unverdicted · novelty 7.0

FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

cs.CV · 2024-03-21 · conditional · novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.

GeoSearcher: Anchor-Guided Progressive Reasoning for Remote Sensing Visual Grounding with Process Supervision

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

GeoSearcher introduces anchor-centric reasoning supervised fine-tuning and process-faithful group relative policy optimization to improve MLLM-based remote sensing visual grounding.

ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models

cs.CR · 2026-07-01 · unverdicted · novelty 6.0

ReShift is a reasoning-level backdoor framework for VLMs that uses poisoned data construction and joint optimization to shift CoT trajectories on trigger while preserving surface coherence.

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.

GAVEL: Grounded Caption Error Verification and Localization

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

GAVEL introduces a joint task, dataset, and benchmark for verifying, explaining, and localizing caption-image misalignments, with a supervised baseline that improves grounding and explanation metrics over strong closed-source models.

Omni-Perception Policy Optimization for Multimodal Emotion Reasoning

cs.AI · 2026-06-24 · unverdicted · novelty 6.0

OPPO applies RL with an Omni-Perception Reward and masked-input KL loss to boost cue utilization and suppress hallucinations in emotion reasoning MLLMs, claiming SOTA results on MER-UniBench, MME-Emotion, and MEP-Bench.

Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

GPS framework adds self-guided reasoning modules to lightweight VLMs for fine-grained action understanding, claiming performance near GPT-4o with better factual accuracy on a custom CAP-based dataset.

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

cs.CR · 2026-05-11 · unverdicted · novelty 6.0

DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

cs.LG · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

SURGE proposes a dual-path gradient compensator and adaptive gradient scaler to mitigate gradient mismatch in binary neural network training via auxiliary backpropagation.

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

cs.CV · 2026-05-02 · unverdicted · novelty 6.0 · 2 refs

VISTA is a new ~12K-pair benchmark and taxonomy for open-set multi-entity spatio-temporal understanding in VLMs that decomposes videos into entities, actions, and relational dynamics for multi-axis diagnostics.

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

cs.CV · 2025-12-03 · unverdicted · novelty 6.0

ThinkDeeper introduces a world-model-based reasoning step that predicts future spatial states to improve multimodal visual grounding for autonomous vehicles, achieving top results on Talk2Car and other benchmarks.

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

cs.CV · 2025-03-19 · unverdicted · novelty 6.0

MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

cs.RO · 2024-12-09 · unverdicted · novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

cs.CV · 2024-10-22 · unverdicted · novelty 6.0

LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.

citing papers explorer

Showing 4 of 4 citing papers after filters.

GAVEL: Grounded Caption Error Verification and Localization cs.CL · 2026-06-25 · unverdicted · none · ref 79 · internal anchor
GAVEL introduces a joint task, dataset, and benchmark for verifying, explaining, and localizing caption-image misalignments, with a supervised baseline that improves grounding and explanation metrics over strong closed-source models.
Toxic Memes: A Survey of Computational Perspectives on the Detection and Explanation of Meme Toxicities cs.CL · 2024-06-11 · accept · none · ref 82 · internal anchor
A PRISMA-based survey of 158 computational works on toxic meme detection introduces a new toxicity taxonomy and a framework linking target, intent, and conveyance tactics while noting trends in LLMs and cross-modal methods.
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation cs.CL · 2023-11-13 · unverdicted · none · ref 2 · internal anchor
AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs cs.CL · 2025-08-28 · unverdicted · none · ref 50 · internal anchor
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer