arxiv: 2306.13394 · v5 · submitted 2023-06-23 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Caifeng Shan, Chaoyou Fu, Jinrui Yang, Ke Li, Mengdan Zhang, Peixian Chen, Ran He, Rongrong Ji, Xiawu Zheng, Xing Sun, Xu Lin, Yulei Qin, Yunhang Shen, Yunsheng Wu

Pith reviewed 2026-05-10 20:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsevaluation benchmarkperception abilitiescognition abilitiesinstruction-answer pairsmodel comparison

0 comments

The pith

A new benchmark evaluates multimodal large language models on 14 perception and cognition subtasks using hand-designed questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a benchmark to test how well multimodal large language models handle both perception tasks such as recognizing objects in images and cognition tasks such as reasoning from visual input. It creates 14 subtasks with manually written instruction-answer pairs to avoid models simply remembering data from public sources. The short, fixed instructions let different models be compared directly without extra prompt tuning. When 30 current models are run through the benchmark, the results show clear shortfalls in many areas and suggest specific places where future models could be strengthened.

Core claim

The central claim is that a benchmark built from 14 subtasks can measure both perception and cognition abilities in multimodal large language models, that manually designed instruction-answer pairs prevent data leakage while keeping comparisons fair, and that evaluations of 30 existing models demonstrate substantial remaining gaps along with concrete directions for improvement.

What carries the argument

The MME benchmark, which consists of 14 subtasks split between perception and cognition, each using concise manually crafted instruction-answer pairs that support direct scoring without prompt engineering.

If this is right

Models can be ranked on specific perception and cognition skills without the results depending on how prompts are worded.
Weaknesses in particular subtasks become visible so optimization can target those gaps directly.
Quantitative scores across many models become possible, revealing patterns that case studies alone do not show.
Future model releases can be checked against the same fixed set of tasks for consistent progress tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could become a standard test set that new models are required to report on before publication.
Training pipelines might incorporate the 14 subtasks as additional supervision signals to close the observed gaps.
Similar hand-designed evaluation sets could be created for other multimodal domains such as video or audio.

Load-bearing premise

The hand-designed instruction-answer pairs are sufficient to block data leakage from existing public datasets and the short instructions produce fair comparisons across models without any prompt tuning.

What would settle it

A model achieving significantly higher scores on the same subtasks when given different or longer instructions, or evidence that the test pairs appear in the training data of evaluated models.

read the original abstract

Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data are released at the project page https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces MME, the first comprehensive benchmark for Multimodal Large Language Models (MLLMs), comprising 14 subtasks that separately assess perception and cognition abilities. To mitigate data leakage, all instruction-answer pairs are manually designed rather than drawn from public datasets; concise, fixed instructions are used to enable direct, prompt-engineering-free comparisons across models. The authors evaluate 30 advanced MLLMs on the benchmark and conclude that substantial headroom remains for improvement in both perception and cognition.

Significance. If the no-leakage and instruction-invariance properties can be demonstrated, MME would supply a much-needed standardized yardstick for MLLM progress, analogous to GLUE or ImageNet in their respective domains. The public release of the data and the separation of perception versus cognition subtasks are concrete strengths that would allow the community to track targeted improvements.

major comments (3)

[§3] §3 (Benchmark Construction): The claim that manually designed instruction-answer pairs eliminate data leakage is unsupported by any reported overlap audit, n-gram analysis, or membership inference check against the training corpora of the 30 evaluated MLLMs (e.g., LAION-5B, COCO, or VQAv2). Because every quantitative result rests on the assumption that the test pairs are unseen, this omission is load-bearing for the central validity claim.
[§4.2] §4.2 (Model Evaluation): No ablation is presented that varies instruction phrasing while holding the underlying image-question pairs fixed. Without such evidence, the assertion that the chosen concise instructions remove prompt-engineering variance cannot be verified, directly affecting the fairness of the cross-model ranking.
[§3.2] §3.2 (Annotation Process): Inter-annotator agreement statistics (e.g., Cohen’s κ or percentage agreement) are not reported for the manually created answer labels across the 14 subtasks. This is required to establish that the ground-truth answers are reliable rather than idiosyncratic to the annotators.

minor comments (3)

[Table 1] Table 1: The column headers for perception versus cognition subtasks would be clearer if an explicit grouping line or background shading were added.
[§5] §5 (Discussion): A few citations to contemporaneous MLLM evaluation efforts (e.g., recent works on LLaVA or InstructBLIP) appear to be missing from the related-work section.
[Figure 2] Figure 2: Axis labels on the radar charts are occasionally truncated; ensure all subtask names remain fully legible at print resolution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive summary and for highlighting areas where additional evidence can strengthen the paper. We address each of the major comments in detail below and outline the revisions we plan to make.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The claim that manually designed instruction-answer pairs eliminate data leakage is unsupported by any reported overlap audit, n-gram analysis, or membership inference check against the training corpora of the 30 evaluated MLLMs (e.g., LAION-5B, COCO, or VQAv2). Because every quantitative result rests on the assumption that the test pairs are unseen, this omission is load-bearing for the central validity claim.

Authors: We agree that providing evidence for the lack of data leakage is important to validate the benchmark. Our instruction-answer pairs were entirely manually crafted by the authors, deliberately avoiding any direct extraction from public datasets to prevent leakage. To address this concern, we will add an n-gram overlap analysis with widely used datasets such as COCO, VQAv2, and others in the revised manuscript. A full membership inference check against the proprietary training data of all 30 MLLMs is not possible due to lack of public access to those corpora; however, the manual design process ensures that the pairs are original and not copied from known sources. revision: partial
Referee: [§4.2] §4.2 (Model Evaluation): No ablation is presented that varies instruction phrasing while holding the underlying image-question pairs fixed. Without such evidence, the assertion that the chosen concise instructions remove prompt-engineering variance cannot be verified, directly affecting the fairness of the cross-model ranking.

Authors: We thank the referee for this suggestion. While our concise instructions were designed to minimize prompt engineering effects and enable consistent comparisons, we recognize the value of empirical validation. In the revised manuscript, we will include an ablation study where we vary the instruction phrasing for a selection of subtasks and models, demonstrating that the performance rankings remain largely consistent. revision: yes
Referee: [§3.2] §3.2 (Annotation Process): Inter-annotator agreement statistics (e.g., Cohen’s κ or percentage agreement) are not reported for the manually created answer labels across the 14 subtasks. This is required to establish that the ground-truth answers are reliable rather than idiosyncratic to the annotators.

Authors: We acknowledge the importance of demonstrating label reliability. The annotations were manually designed by the authors with careful consideration to make answers objective and unambiguous. We did not collect formal inter-annotator agreement statistics during the process. In the revision, we will expand the description of the annotation procedure to better convey how subjectivity was minimized. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark is manually constructed and externally evaluated

full rationale

The paper introduces the MME benchmark by manually designing instruction-answer pairs for 14 subtasks to measure perception and cognition in MLLMs. It then directly evaluates 30 external models on these fixed pairs and reports aggregate scores. No parameters are fitted to the benchmark data, no predictions are generated from the benchmark outputs that loop back to its construction, and no uniqueness theorems or ansatzes are invoked via self-citation. The central claims rest on the external model evaluations and the manual design process itself, which is presented as an independent methodological choice rather than a derived result. This satisfies the criteria for a self-contained benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No free parameters, mathematical axioms, or invented entities are introduced; the central claim rests on the manual annotation process and subtask selection.

pith-pipeline@v0.9.0 · 5557 in / 1042 out tokens · 47919 ms · 2026-05-10T20:20:20.783210+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression
cs.NE 2026-04 unverdicted novelty 8.0

SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
cs.DB 2026-05 conditional novelty 7.0

OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
cs.AI 2026-05 unverdicted novelty 7.0

PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
cs.CL 2026-05 unverdicted novelty 7.0

A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision
cs.CV 2026-04 unverdicted novelty 7.0

EgoPoint-Bench reveals that MLLMs suffer from referential hallucination on egocentric pointing and shows that fine-tuning on its synthetic data produces measurable gains with sim-to-real transfer.
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
S-GRPO: Unified Post-Training for Large Vision-Language Models
cs.LG 2026-04 unverdicted novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
cs.CV 2026-04 unverdicted novelty 7.0

DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
cs.CV 2026-04 unverdicted novelty 7.0

ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
cs.CV 2026-05 unverdicted novelty 6.0

A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent
cs.CV 2026-05 unverdicted novelty 6.0

CAVI framework uses character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to resolve modality-role interference in multimodal RPAs.
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.
Anisotropic Modality Align
cs.MM 2026-05 unverdicted novelty 6.0

Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Decoder-based VLMs over-align visual features to a universal text subspace, injecting linguistic bias; projecting out its top principal components reduces hallucinations on POPE, CHAIR, AMBER and improves long-form ca...
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
cs.LG 2026-05 unverdicted novelty 6.0

VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
cs.AI 2026-05 unverdicted novelty 6.0

Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
cs.CV 2026-04 unverdicted novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
cs.CV 2026-04 unverdicted novelty 6.0

Hallucinations in LVLMs largely arise from textual priors in prompts, and can be reduced by fine-tuning with preference optimization on grounded vs. hallucinated response pairs.
Latent Denoising Improves Visual Alignment in Large Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
cs.CV 2026-04 conditional novelty 6.0

R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
Mitigating Multimodal Hallucination via Phase-wise Self-reward
cs.CV 2026-04 unverdicted novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
Towards Joint Quantization and Token Pruning of Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

QUOTA jointly optimizes low-bit quantization and visual token pruning for VLMs by deriving pruning decisions from quantized operators, achieving 95.65% average performance retention with only 30% of visual tokens vers...
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
D-QRELO: Training- and Data-Free Delta Compression for Large Language Models via Quantization and Residual Low-Rank Approximation
cs.LG 2026-04 unverdicted novelty 6.0

D-QRELO compresses LLM delta weights via one-bit quantization followed by compensated residual low-rank approximation and outperforms prior methods on dense and MoE models with large SFT datasets.
PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
cs.CV 2026-04 unverdicted novelty 6.0

PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.
HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

HTDC mitigates hallucinations in LVLMs by triggering calibration only at hesitation-prone decoding steps via contrasts with visual-nullification and semantic-nullification probes.
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
cs.CV 2026-04 unverdicted novelty 6.0

AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
Discovering Failure Modes in Vision-Language Models using RL
cs.CV 2026-04 unverdicted novelty 6.0

An RL-based questioner agent adaptively generates queries to discover novel failure modes in VLMs without human intervention.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 79 Pith papers · 15 internal anchors

[1]

Infmllm.https://github.com/mightyzau/InfMLLM, 2023

work page 2023
[2]

Lion.https://github.com/mynameischaos/Lion, 2023

work page 2023
[3]

Octopus.https://github.com/gray311/UnifiedMultimodalInstructionTuning, 2023

work page 2023
[4]

Skywork-mm.https://github.com/will-singularity/Skywork-MM, 2023

work page 2023
[5]

Visualglm-6b.https://github.com/THUDM/VisualGLM-6B, 2023

work page 2023
[6]

Wemm.https://github.com/scenarios/WeMM, 2023

work page 2023
[7]

Xcomposer-vl.https://github.com/InternLM/InternLM-XComposer, 2023

work page 2023
[8]

Flamingo: a visual language model for few-shot learning.NeurIPS, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 2022. 9

work page 2022
[9]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 2020

work page 2020
[11]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint:1504.00325, 2015

work page internal anchor Pith review arXiv 2015
[12]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.arXiv preprint:2305.06500, 2023

work page internal anchor Pith review arXiv 2023
[13]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint:2303.03378, 2023

work page internal anchor Pith review arXiv 2023
[14]

Mmbench- video: A long-form multi-shot benchmark for holistic video understanding.NeurIPS, 2024

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench- video: A long-form multi-shot benchmark for holistic video understanding.NeurIPS, 2024

work page 2024
[15]

arXiv preprint arXiv:2304.15010 , year=

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Con- ghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model.arXiv preprint:2304.15010, 2023

work page arXiv 2023
[16]

arXiv preprint arXiv:2305.04790 , year=

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint:2305.04790, 2023

work page arXiv 2023
[17]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017

work page 2017
[18]

arXiv preprint arXiv:2309.03905 , year=

Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning.arXiv preprint:2309.03905, 2023

work page arXiv 2023
[19]

Bliva: A simple multimodal llm for better handling of text-rich visual questions.arXiv preprint:2308.09936, 2023

Wenbo Hu, Yifan Xu, Y Li, W Li, Z Chen, and Z Tu. Bliva: A simple multimodal llm for better handling of text-rich visual questions.arXiv preprint:2308.09936, 2023

work page arXiv 2023
[20]

Movienet: A holistic dataset for movie understanding

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. InECCV, 2020

work page 2020
[21]

Language is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models.arXiv preprint:2302.14045, 2023

work page arXiv 2023
[22]

Li, and Ziwei Liu

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning.arXiv preprint:2306.05425, 2023

work page arXiv 2023
[23]

Otter: A multi-modal model 9 with in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint:2305.03726, 2023

work page arXiv 2023
[24]

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions.arXiv preprint:2308.04152, 2023

Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions.arXiv preprint:2308.04152, 2023

work page arXiv 2023
[25]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint:2301.12597, 2023

work page internal anchor Pith review arXiv 2023
[26]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint:2305.10355, 2023

work page internal anchor Pith review arXiv 2023
[27]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

work page 2014
[28]

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models.arXiv preprint:2311.07575, 2023. 10

work page arXiv 2023
[29]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning.arXiv preprint:2306.14565, 2023

work page internal anchor Pith review arXiv 2023
[30]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint:2304.08485, 2023

work page internal anchor Pith review arXiv 2023
[31]

Geometry-guided dense perspective network for speech-driven facial animation.IEEE TVCG, 2021

Jingying Liu, Binyuan Hui, Kun Li, Yunke Liu, Yu-Kun Lai, Yuxiang Zhang, Yebin Liu, and Jingyu Yang. Geometry-guided dense perspective network for speech-driven facial animation.IEEE TVCG, 2021

work page 2021
[32]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player?arXiv preprint:2307.06281, 2023

work page internal anchor Pith review arXiv 2023
[33]

Curved scene text detection via transverse and longitudinal sequence connection.PR, 2019

Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. Curved scene text detection via transverse and longitudinal sequence connection.PR, 2019

work page 2019
[34]

Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 2022

work page 2022
[35]

Cheap and quick: Efficient vision- language instruction tuning for large language models.arXiv preprint arXiv:2305.15023, 2023

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language instruction tuning for large language models.arXiv preprint:2305.15023, 2023

work page arXiv 2023
[36]

Deepart: Learning joint representations of visual arts

Hui Mao, Ming Cheung, and James She. Deepart: Learning joint representations of visual arts. InICM, 2017

work page 2017
[37]

Visual arts search on mobile devices.TOMM, 2019

Hui Mao, James She, and Ming Cheung. Visual arts search on mobile devices.TOMM, 2019

work page 2019
[38]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InCVPR, 2019

work page 2019
[39]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface.arXiv preprint:2303.17580, 2023

work page internal anchor Pith review arXiv 2023
[41]

Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction- follow them all.arXiv preprint:2305.16355, 2023

work page arXiv 2023
[42]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review arXiv 2025
[43]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

An overview of large ai models and their applications.Visual Intelligence, 2024

Xiaoguang Tu, Zhi He, Yi Huang, Zhi-Hao Zhang, Ming Yang, and Jian Zhao. An overview of large ai models and their applications.Visual Intelligence, 2024

work page 2024
[45]

Git: A generative image-to-text transformer for vision and language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint:2205.14100, 2022

work page arXiv 2022
[46]

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks.arXiv preprint:2305.11175, 2023

work page arXiv 2023
[47]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.arXiv preprint:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE TPAMI, 2025

Hao Wen, Hongbo Kang, Jian Ma, Jing Huang, Yuanwang Yang, Haozhe Lin, Yu-Kun Lai, and Kun Li. Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE TPAMI, 2025

work page 2025
[49]

Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval

Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. InCVPR, 2020

work page 2020
[50]

An early evaluation of gpt-4v (ision).arXiv preprint:2310.16534, 2023

Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, and Bing Qin. An early evaluation of gpt-4v (ision).arXiv preprint:2310.16534, 2023. 11

work page arXiv 2023
[51]

Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning

Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning.arXiv preprint:2212.10773, 2022

work page arXiv 2022
[52]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.arXiv preprint:2311.04257, 2023

work page arXiv 2023
[53]

arXiv preprint arXiv:2306.13549 , year=

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint:2306.13549, 2023

work page arXiv 2023
[54]

Woodpecker: Hallucination correction for multimodal large language models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.arXiv preprint:2310.16045, 2023

work page arXiv 2023
[55]

Mmt- 17 bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

work page arXiv 2024
[56]

Reformulating vision-language foundation models and datasets towards universal multimodal assistants

Tianyu Yu, Jinyi Hu, Yuan Yao, Haoye Zhang, Yue Zhao, Chongyi Wang, Shan Wang, Yinxv Pan, Jiao Xue, Dahai Li, et al. Reformulating vision-language foundation models and datasets towards universal multimodal assistants.arXiv preprint:2310.00653, 2023

work page arXiv 2023
[57]

What matters in training a gpt4-style language model with mul- timodal inputs? arXiv preprint arXiv:2307.02469 , 2023

Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multimodal inputs?arXiv preprint:2307.02469, 2023

work page arXiv 2023
[58]

Transfer visual prompt generator across llms.arXiv preprint:2305.01278, 2023

Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual prompt generator across llms.arXiv preprint:2305.01278, 2023

work page arXiv 2023
[59]

Logavatar: Local gaussian splatting for human avatar modeling from monocular video.CAD, 2025

Jinsong Zhang, Xiongzheng Li, Hailong Jia, Jin Li, Zhuo Su, Guidong Wang, and Kun Li. Logavatar: Local gaussian splatting for human avatar modeling from monocular video.CAD, 2025

work page 2025
[60]

Speechact: Towards generating whole-body motion from speech.IEEE TVCG, 2025

Jinsong Zhang, Minjie Zhu, Yuxiang Zhang, Zerong Zheng, Yebin Liu, and Kun Li. Speechact: Towards generating whole-body motion from speech.IEEE TVCG, 2025

work page 2025
[61]

Mmicl: Empowering vision-language model with multi-modal in- context learning.arXiv preprint:2309.07915, 2023

Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. Mmicl: Empowering vision-language model with multi-modal in- context learning.arXiv preprint:2309.07915, 2023

work page arXiv 2023
[62]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint:2303.18223, 2023

work page Pith review arXiv 2023
[63]

On evaluating adversarial robustness of large vision-language models

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.arXiv preprint:2305.16934, 2023

work page arXiv 2023
[64]

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst.arXiv:2305.16103, 2023

Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, and Jing Liu. Chatbridge: Bridging modalities with large language model as a language catalyst.arXiv preprint:2305.16103, 2023

work page arXiv 2023
[65]

Learning deep features for scene recognition using places database.NeurIPS, 2014

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database.NeurIPS, 2014

work page 2014
[66]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint:2304.10592, 2023. 12

work page internal anchor Pith review arXiv 2023