Recognition: 2 theorem links
· Lean TheoremSEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Pith reviewed 2026-05-12 16:55 UTC · model grok-4.3
The pith
SEED-Bench supplies 19K human-verified multiple-choice questions to measure multimodal LLMs on image and video comprehension across 12 dimensions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SEED-Bench consists of 19K multiple choice questions with accurate human annotations, which spans 12 evaluation dimensions including the comprehension of both the image and video modality, enabling an objective and efficient assessment of model performance without human or GPT intervention during evaluation.
What carries the argument
The pipeline that generates multiple-choice questions targeting specific dimensions through automatic filtering followed by manual verification.
If this is right
- Evaluating 18 models across all 12 dimensions reveals concrete limitations in current MLLMs for both spatial and temporal understanding.
- The benchmark supports consistent leaderboard tracking that lets the community compare progress without repeated human judgment.
- Insights from the results can directly motivate targeted improvements in models that handle image and video modalities together.
Where Pith is reading between the lines
- Widespread use of this benchmark could make cross-model comparisons more reliable by fixing the question set and scoring method.
- The scale and verification process may encourage development of models that maintain performance when questions shift from multiple choice to free-form generation.
- Extending similar pipelines to new modalities could help identify whether comprehension gaps are modality-specific or general.
Load-bearing premise
The questions produced by automatic generation plus manual verification actually test genuine generative comprehension instead of artifacts from the creation process.
What would settle it
An experiment showing that models scoring highest on SEED-Bench still fail to produce accurate open-ended descriptions or answers on the same image and video content.
read the original abstract
Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SEED-Bench, a benchmark of 19K multiple-choice questions with human annotations for evaluating generative comprehension in Multimodal LLMs (MLLMs). It spans 12 dimensions covering spatial and temporal understanding of both image and video modalities, constructed via an automatic question-generation pipeline with filtering and manual verification. The authors evaluate 18 existing MLLMs on the benchmark, reveal their limitations, and announce a public leaderboard.
Significance. If validated to require genuine multimodal input, SEED-Bench would be a meaningful contribution due to its scale (six times larger than prior benchmarks) and broad coverage of 12 dimensions. A well-controlled benchmark of this size could standardize evaluation of MLLM comprehension and guide improvements in visual-language integration.
major comments (3)
- [Section 3] Benchmark construction (Section 3): The pipeline description provides no quantitative evidence that questions cannot be solved from question text and options alone (e.g., no text-only baseline accuracy reported, no ablation removing images/videos). This directly undermines the central claim that performance measures multimodal comprehension rather than language priors.
- [Section 3.2] Annotation process (Section 3.2): No inter-annotator agreement statistics or details on how the 12 evaluation dimensions were selected and operationalized are reported, weakening confidence that the 19K questions reliably target the intended spatial/temporal capabilities.
- [Section 4] Evaluation results (Section 4): The reported model scores lack analysis of whether errors correlate with visual content (e.g., via attention maps or controlled perturbations); without this, it is unclear whether the benchmark isolates the claimed generative comprehension limitations.
minor comments (2)
- [Abstract] The abstract and introduction repeat the 'x6 larger' claim without citing the exact sizes of the compared benchmarks.
- [Figure 1] Figure 1 caption could more explicitly label the 12 dimensions and their image/video split for quick reference.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help strengthen the manuscript. We address each major comment point by point below, agreeing where revisions are warranted and providing clarifications where the existing work already supports our claims. We will update the paper accordingly in the revised version.
read point-by-point responses
-
Referee: [Section 3] Benchmark construction (Section 3): The pipeline description provides no quantitative evidence that questions cannot be solved from question text and options alone (e.g., no text-only baseline accuracy reported, no ablation removing images/videos). This directly undermines the central claim that performance measures multimodal comprehension rather than language priors.
Authors: We agree that explicit quantitative validation is important to confirm the benchmark requires multimodal input. Although the questions are generated from visual content with human-annotated ground truth and filtered to target specific visual dimensions, we did not report a text-only baseline in the original submission. In the revised manuscript, we will add evaluations of multiple models on the text-only version of SEED-Bench, demonstrating substantially lower accuracy without images or videos. This will directly support that the benchmark measures generative multimodal comprehension rather than language priors alone. revision: yes
-
Referee: [Section 3.2] Annotation process (Section 3.2): No inter-annotator agreement statistics or details on how the 12 evaluation dimensions were selected and operationalized are reported, weakening confidence that the 19K questions reliably target the intended spatial/temporal capabilities.
Authors: We acknowledge the value of reporting inter-annotator agreement to increase confidence in the annotations. We will add these statistics (e.g., agreement rates across the manual verification step) to the revised Section 3.2. The 12 dimensions were selected to comprehensively cover spatial and temporal understanding for both images and videos, drawing from established categories in visual reasoning and video comprehension literature. We will expand the description of how each dimension is operationalized through targeted question templates and examples in the updated manuscript. revision: yes
-
Referee: [Section 4] Evaluation results (Section 4): The reported model scores lack analysis of whether errors correlate with visual content (e.g., via attention maps or controlled perturbations); without this, it is unclear whether the benchmark isolates the claimed generative comprehension limitations.
Authors: This is a fair point for deeper validation of error sources. The current results already show systematic weaknesses across models on specific dimensions (e.g., temporal reasoning), which we attribute to multimodal integration challenges based on the question design. However, attention map analysis or systematic perturbations would require additional experiments not included in this benchmark-focused work. In the revision, we will incorporate a qualitative error analysis with example cases linking failures to visual elements, along with a discussion of how such analyses could be pursued in future work. revision: partial
Circularity Check
No circularity: benchmark construction is descriptive and externally verifiable
full rationale
The paper introduces SEED-Bench via an explicit pipeline of automatic question generation, filtering, and human annotation/verification to produce 19K MCQs across 12 dimensions. No equations, fitted parameters, predictions, or derivations are claimed. The central claim (that the resulting questions enable objective evaluation of MLLM comprehension) rests on the described human-verified ground truth rather than reducing to self-definition or self-citation. Evaluation of 18 external models occurs after benchmark creation, providing an independent test. This matches the default expectation of a self-contained benchmark paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 38 Pith papers
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
-
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...
-
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
-
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.
-
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
-
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
-
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
-
Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
Mind's Eye benchmark shows top multimodal LLMs score below 50% on visual abstraction, relation, and transformation tasks while humans reach 80%.
-
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
-
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
LLMind uses bio-inspired non-uniform sampling via a Mobius module and closed-loop semantic feedback to retain 82-97% of full-resolution VLM performance with only 1-5% of pixels on VQA benchmarks.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
-
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.
-
Latent Denoising Improves Visual Alignment in Large Multimodal Models
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
-
Towards Joint Quantization and Token Pruning of Vision-Language Models
QUOTA jointly optimizes low-bit quantization and visual token pruning for VLMs by deriving pruning decisions from quantized operators, achieving 95.65% average performance retention with only 30% of visual tokens vers...
-
PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.
-
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality
ML-CLIPSim aggregates multi-layer patch and global similarities from frozen CLIP to approximate machine utility for images and outperforms standard IQA metrics on machine-preference tasks while staying competitive on ...
-
Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid
A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.
-
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
-
Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation
DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.
-
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
-
Improved Baselines with Visual Instruction Tuning
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Reference graph
Works this paper leans on
-
[1]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [2]
-
[3]
OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022
work page 2022
-
[4]
FastChat. Vicuna. https://github.com/lm-sys/FastChat, 2023
work page 2023
-
[5]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML, 2023
work page 2023
-
[7]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023
work page Pith review arXiv 2023
-
[10]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023
work page internal anchor Pith review arXiv 2023
-
[11]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
Multimodal-gpt: A vision and language model for dialogue with humans, 2023
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023
work page 2023
-
[13]
arXiv preprint arXiv:2305.16355 (2023)
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction- follow them all. arXiv preprint arXiv:2305.16355, 2023
-
[14]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review arXiv 2023
-
[16]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 , 2023
work page internal anchor Pith review arXiv 2023
-
[17]
arXiv preprint arXiv:2306.07207 , year=
Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207, 2023
-
[18]
Planting a seed of vision in large language model
Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023
-
[19]
Generative pretraining in mul- timodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023
-
[20]
Scaling autoregressive multi-modal models: Pretraining and instruction tuning
Yu Lili, Shi Bowen, Pasunuru Ram, Miller Benjamin, Golovneva Olga, Wang Tianlu, Babu Arun, Tang Binh, Karrer Brian, Sheynin Shelly, Ross Candace, Polyak Adam, Howes Russ, Sharma Vasu, Xu Jacob, Singer Uriel, Li (AI) Daniel, Ghosh Gargi, Taigman Yaniv, Fazel-Zarandi Maryam, Celikyilmaz Asli, Zettlemoyer Luke, and Aghajanyan Armen. Scaling autoregressive mu...
work page 2023
-
[21]
Generating images with multimodal language models
Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023
-
[22]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017. 13
work page 2017
-
[23]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Lamm: Language-assisted multi- modal instruction-tuning dataset, framework, and bench- mark
Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023
-
[25]
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023
-
[26]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023
work page internal anchor Pith review arXiv 2023
-
[27]
Tag2text: Guiding vision-language model via image tagging
Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023
-
[28]
Grit: A generative region-to-text transformer for object understanding
Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022
-
[29]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Vinvl: Revisiting visual representations in vision-language models
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In CVPR, 2021
work page 2021
- [31]
-
[32]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020
work page 1901
-
[33]
Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, and Ying Shan. What makes for good visual tokenizers for large language models? arXiv preprint arXiv:2305.12223, 2023
-
[34]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018
work page 2018
-
[35]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017
work page 2017
-
[36]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision. arXiv preprint arXiv:2006.13256, 2020
-
[37]
The language of actions: Recovering the syntax and semantics of goal-directed human activities
Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR, 2014
work page 2014
-
[38]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022
work page 2022
-
[39]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
Transfer visual prompt generator across llms
Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual prompt generator across llms. abs/23045.01278, 2023
-
[41]
ml_foundations. Openflamingo. https://github.com/mlfoundations/open_flamingo, 2023
work page 2023
-
[42]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023. 14
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.