hub

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny · 2023 · arXiv 2310.09478

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

cs.CV · 2026-05-02 · unverdicted · novelty 8.0

VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.

Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles

cs.CY · 2026-05-13 · unverdicted · novelty 7.0

A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.

STORM: End-to-End Referring Multi-Object Tracking in Videos

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

cs.CR · 2026-05-11 · unverdicted · novelty 6.0

DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

cs.CV · 2023-11-21 · conditional · novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

cs.CV · 2023-06-26 · accept · novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.

StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

cs.CV · 2026-05-05 · unverdicted · novelty 5.0

StateVLM uses an Auxiliary Regression Loss on box decoder outputs to boost VLMs' accuracy on object and state localization for robotic affordance reasoning, with gains of 1.6% on RefCOCO variants and 5.2% on the new OSAR benchmark.

Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

cs.CV · 2026-04-22 · unverdicted · novelty 5.0

MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.

Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

cs.CV · 2026-04-07 · unverdicted · novelty 5.0

RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.

Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

cs.CV · 2026-04-06 · unverdicted · novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

A Survey on Hallucination in Large Vision-Language Models

cs.CV · 2024-02-01 · unverdicted · novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

citing papers explorer

Showing 16 of 16 citing papers.

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark cs.CV · 2026-05-02 · unverdicted · none · ref 12
VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles cs.CY · 2026-05-13 · unverdicted · none · ref 17
A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.
STORM: End-to-End Referring Multi-Object Tracking in Videos cs.CV · 2026-04-12 · unverdicted · none · ref 14
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 6
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 3
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 43
SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing cs.CR · 2026-05-11 · unverdicted · none · ref 24
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks cs.LG · 2026-05-09 · unverdicted · none · ref 69
SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 8
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions cs.CV · 2023-11-21 · conditional · none · ref 5
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning cs.CV · 2023-06-26 · accept · none · ref 4
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 8
StateVLM uses an Auxiliary Regression Loss on box decoder outputs to boost VLMs' accuracy on object and state localization for robotic affordance reasoning, with gains of 1.6% on RefCOCO variants and 5.2% on the new OSAR benchmark.
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation cs.CV · 2026-04-22 · unverdicted · none · ref 135
MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis cs.CV · 2026-04-07 · unverdicted · none · ref 58
RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation cs.CV · 2026-04-06 · unverdicted · none · ref 5
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
A Survey on Hallucination in Large Vision-Language Models cs.CV · 2024-02-01 · unverdicted · none · ref 4
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer