arxiv: 2307.16125 · v2 · submitted 2023-07-30 · 💻 cs.CL · cs.CV

Recognition: 2 theorem links

· Lean Theorem

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li , Rui Wang , Guangzhi Wang , Yuying Ge , Yixiao Ge , Ying Shan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 16:55 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords multimodal LLMsgenerative comprehensionbenchmarkimage understandingvideo understandingmultiple choice evaluationspatial and temporal reasoning

0 comments

The pith

SEED-Bench supplies 19K human-verified multiple-choice questions to measure multimodal LLMs on image and video comprehension across 12 dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create an objective way to test the generative comprehension skills of multimodal large language models. It builds a benchmark six times larger than prior ones, with questions that cover both static images and dynamic videos. A pipeline combines automatic generation with human checks to produce reliable multiple-choice items whose correct answers come directly from annotations. This setup allows models to be scored without needing extra human or model judges at evaluation time. Testing eighteen existing models then shows where current systems fall short in spatial and temporal understanding.

Core claim

SEED-Bench consists of 19K multiple choice questions with accurate human annotations, which spans 12 evaluation dimensions including the comprehension of both the image and video modality, enabling an objective and efficient assessment of model performance without human or GPT intervention during evaluation.

What carries the argument

The pipeline that generates multiple-choice questions targeting specific dimensions through automatic filtering followed by manual verification.

If this is right

Evaluating 18 models across all 12 dimensions reveals concrete limitations in current MLLMs for both spatial and temporal understanding.
The benchmark supports consistent leaderboard tracking that lets the community compare progress without repeated human judgment.
Insights from the results can directly motivate targeted improvements in models that handle image and video modalities together.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of this benchmark could make cross-model comparisons more reliable by fixing the question set and scoring method.
The scale and verification process may encourage development of models that maintain performance when questions shift from multiple choice to free-form generation.
Extending similar pipelines to new modalities could help identify whether comprehension gaps are modality-specific or general.

Load-bearing premise

The questions produced by automatic generation plus manual verification actually test genuine generative comprehension instead of artifacts from the creation process.

What would settle it

An experiment showing that models scoring highest on SEED-Bench still fail to produce accurate open-ended descriptions or answers on the same image and video content.

read the original abstract

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SEED-Bench, a benchmark of 19K multiple-choice questions with human annotations for evaluating generative comprehension in Multimodal LLMs (MLLMs). It spans 12 dimensions covering spatial and temporal understanding of both image and video modalities, constructed via an automatic question-generation pipeline with filtering and manual verification. The authors evaluate 18 existing MLLMs on the benchmark, reveal their limitations, and announce a public leaderboard.

Significance. If validated to require genuine multimodal input, SEED-Bench would be a meaningful contribution due to its scale (six times larger than prior benchmarks) and broad coverage of 12 dimensions. A well-controlled benchmark of this size could standardize evaluation of MLLM comprehension and guide improvements in visual-language integration.

major comments (3)

[Section 3] Benchmark construction (Section 3): The pipeline description provides no quantitative evidence that questions cannot be solved from question text and options alone (e.g., no text-only baseline accuracy reported, no ablation removing images/videos). This directly undermines the central claim that performance measures multimodal comprehension rather than language priors.
[Section 3.2] Annotation process (Section 3.2): No inter-annotator agreement statistics or details on how the 12 evaluation dimensions were selected and operationalized are reported, weakening confidence that the 19K questions reliably target the intended spatial/temporal capabilities.
[Section 4] Evaluation results (Section 4): The reported model scores lack analysis of whether errors correlate with visual content (e.g., via attention maps or controlled perturbations); without this, it is unclear whether the benchmark isolates the claimed generative comprehension limitations.

minor comments (2)

[Abstract] The abstract and introduction repeat the 'x6 larger' claim without citing the exact sizes of the compared benchmarks.
[Figure 1] Figure 1 caption could more explicitly label the 12 dimensions and their image/video split for quick reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the manuscript. We address each major comment point by point below, agreeing where revisions are warranted and providing clarifications where the existing work already supports our claims. We will update the paper accordingly in the revised version.

read point-by-point responses

Referee: [Section 3] Benchmark construction (Section 3): The pipeline description provides no quantitative evidence that questions cannot be solved from question text and options alone (e.g., no text-only baseline accuracy reported, no ablation removing images/videos). This directly undermines the central claim that performance measures multimodal comprehension rather than language priors.

Authors: We agree that explicit quantitative validation is important to confirm the benchmark requires multimodal input. Although the questions are generated from visual content with human-annotated ground truth and filtered to target specific visual dimensions, we did not report a text-only baseline in the original submission. In the revised manuscript, we will add evaluations of multiple models on the text-only version of SEED-Bench, demonstrating substantially lower accuracy without images or videos. This will directly support that the benchmark measures generative multimodal comprehension rather than language priors alone. revision: yes
Referee: [Section 3.2] Annotation process (Section 3.2): No inter-annotator agreement statistics or details on how the 12 evaluation dimensions were selected and operationalized are reported, weakening confidence that the 19K questions reliably target the intended spatial/temporal capabilities.

Authors: We acknowledge the value of reporting inter-annotator agreement to increase confidence in the annotations. We will add these statistics (e.g., agreement rates across the manual verification step) to the revised Section 3.2. The 12 dimensions were selected to comprehensively cover spatial and temporal understanding for both images and videos, drawing from established categories in visual reasoning and video comprehension literature. We will expand the description of how each dimension is operationalized through targeted question templates and examples in the updated manuscript. revision: yes
Referee: [Section 4] Evaluation results (Section 4): The reported model scores lack analysis of whether errors correlate with visual content (e.g., via attention maps or controlled perturbations); without this, it is unclear whether the benchmark isolates the claimed generative comprehension limitations.

Authors: This is a fair point for deeper validation of error sources. The current results already show systematic weaknesses across models on specific dimensions (e.g., temporal reasoning), which we attribute to multimodal integration challenges based on the question design. However, attention map analysis or systematic perturbations would require additional experiments not included in this benchmark-focused work. In the revision, we will incorporate a qualitative error analysis with example cases linking failures to visual elements, along with a discussion of how such analyses could be pursued in future work. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark construction is descriptive and externally verifiable

full rationale

The paper introduces SEED-Bench via an explicit pipeline of automatic question generation, filtering, and human annotation/verification to produce 19K MCQs across 12 dimensions. No equations, fitted parameters, predictions, or derivations are claimed. The central claim (that the resulting questions enable objective evaluation of MLLM comprehension) rests on the described human-verified ground truth rather than reducing to self-definition or self-citation. Evaluation of 18 external models occurs after benchmark creation, providing an independent test. This matches the default expectation of a self-contained benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations, fitted parameters, or new postulated entities; it relies on standard machine-learning evaluation practices and human annotation.

pith-pipeline@v0.9.0 · 5536 in / 1044 out tokens · 29072 ms · 2026-05-12T16:55:12.675278+00:00 · methodology

discussion (0)

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
cs.CV 2026-05 unverdicted novelty 7.0

Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
cs.CV 2026-04 unverdicted novelty 7.0

COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
cs.CV 2026-04 unverdicted novelty 7.0

COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
cs.CL 2026-04 unverdicted novelty 7.0

Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
cs.CV 2026-04 unverdicted novelty 7.0

Mind's Eye benchmark shows top multimodal LLMs score below 50% on visual abstraction, relation, and transformation tasks while humans reach 80%.
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
cs.CV 2026-04 unverdicted novelty 7.0

GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

LLMind uses bio-inspired non-uniform sampling via a Mobius module and closed-loop semantic feedback to retain 82-97% of full-resolution VLM performance with only 1-5% of pixels on VQA benchmarks.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
cs.LG 2026-05 unverdicted novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
cs.AI 2026-05 unverdicted novelty 6.0

Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.
Latent Denoising Improves Visual Alignment in Large Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
Towards Joint Quantization and Token Pruning of Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

QUOTA jointly optimizes low-bit quantization and visual token pruning for VLMs by deriving pruning decisions from quantized operators, achieving 95.65% average performance retention with only 30% of visual tokens vers...
PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
cs.CV 2026-04 unverdicted novelty 6.0

PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
cs.AI 2026-04 unverdicted novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality
eess.IV 2026-05 unverdicted novelty 5.0

ML-CLIPSim aggregates multi-layer patch and global similarities from frozen CLIP to approximate machine utility for images and outperforms standard IQA metrics on machine-preference tasks while staying competitive on ...
Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid
cs.AI 2026-05 unverdicted novelty 5.0

A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
cs.AI 2026-04 unverdicted novelty 5.0

HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation
cs.CV 2026-04 unverdicted novelty 5.0

DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
cs.CV 2025-05 conditional novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
cs.AI 2024-03 unverdicted novelty 4.0

DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
Improved Baselines with Visual Instruction Tuning
cs.CV 2023-10 conditional novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 36 Pith papers · 14 internal anchors

[1]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[3]

Introducing chatgpt

OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022

work page 2022
[4]

FastChat. Vicuna. https://github.com/lm-sys/FastChat, 2023

work page 2023
[5]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML, 2023

work page 2023
[7]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023

work page Pith review arXiv 2023
[10]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023

work page internal anchor Pith review arXiv 2023
[11]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023

work page internal anchor Pith review arXiv 2023
[12]

Multimodal-gpt: A vision and language model for dialogue with humans, 2023

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023

work page 2023
[13]

arXiv preprint arXiv:2305.16355 (2023)

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction- follow them all. arXiv preprint arXiv:2305.16355, 2023

work page arXiv 2023
[14]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review arXiv 2023
[16]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 , 2023

work page internal anchor Pith review arXiv 2023
[17]

arXiv preprint arXiv:2306.07207 , year=

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207, 2023

work page arXiv 2023
[18]

Planting a seed of vision in large language model

Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023

work page arXiv 2023
[19]

Generative pretraining in mul- timodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023

work page arXiv 2023
[20]

Scaling autoregressive multi-modal models: Pretraining and instruction tuning

Yu Lili, Shi Bowen, Pasunuru Ram, Miller Benjamin, Golovneva Olga, Wang Tianlu, Babu Arun, Tang Binh, Karrer Brian, Sheynin Shelly, Ross Candace, Polyak Adam, Howes Russ, Sharma Vasu, Xu Jacob, Singer Uriel, Li (AI) Daniel, Ghosh Gargi, Taigman Yaniv, Fazel-Zarandi Maryam, Celikyilmaz Asli, Zettlemoyer Luke, and Aghajanyan Armen. Scaling autoregressive mu...

work page 2023
[21]

Generating images with multimodal language models

Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023

work page arXiv 2023
[22]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017. 13

work page 2017
[23]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Lamm: Language-assisted multi- modal instruction-tuning dataset, framework, and bench- mark

Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023

work page arXiv 2023
[25]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023

work page arXiv 2023
[26]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review arXiv 2023
[27]

Tag2text: Guiding vision-language model via image tagging

Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023

work page arXiv 2023
[28]

Grit: A generative region-to-text transformer for object understanding

Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022

work page arXiv 2022
[29]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Vinvl: Revisiting visual representations in vision-language models

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In CVPR, 2021

work page 2021
[31]

Paddleocr

https://github.com/PaddlePaddle/PaddleOCR. Paddleocr

work page
[32]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

work page 1901
[33]

What makes for good visual tokenizers for large language models? arXiv preprint arXiv:2305.12223, 2023

Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, and Ying Shan. What makes for good visual tokenizers for large language models? arXiv preprint arXiv:2305.12223, 2023

work page arXiv 2023
[34]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018

work page 2018
[35]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017

work page 2017
[36]

Rescaling egocentric vision

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision. arXiv preprint arXiv:2006.13256, 2020

work page arXiv 2006
[37]

The language of actions: Recovering the syntax and semantics of goal-directed human activities

Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR, 2014

work page 2014
[38]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022

work page 2022
[39]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Transfer visual prompt generator across llms

Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual prompt generator across llms. abs/23045.01278, 2023

work page arXiv 2023
[41]

Openflamingo

ml_foundations. Openflamingo. https://github.com/mlfoundations/open_flamingo, 2023

work page 2023
[42]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023. 14

work page internal anchor Pith review arXiv 2023