arxiv: 2604.27389 · v2 · submitted 2026-04-30 · 💻 cs.CV · cs.AI

Recognition: unknown

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Bingli Wang , Huanze Tang , Haijun Lv , Zhishan Lin , Lixin Gu , Lei Feng , Qipeng Guo , Kai Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal large language modelsbenchmarkimage-text alignmentinterleaved contextsfine-grained understandingerror analysisdocument comprehension

0 comments

The pith

COHERENCE benchmark tests MLLMs on recovering fine-grained image-text correspondences in interleaved contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing multimodal benchmarks focus on single images or separate multi-image sets, yet real tasks like document reading present information as interleaved images and text. The paper creates COHERENCE to measure whether models can locate relevant evidence, form precise alignments between visuals and words, and reason across those links within the surrounding context. It supplies 6161 questions drawn from four domains together with a six-type error breakdown that attributes model mistakes to particular missing skills. Readers would care because this setup isolates the exact capabilities that current models lack for handling everyday mixed-media content.

Core claim

COHERENCE is a benchmark of interleaved image-text content from four representative domains that contains 6161 high-quality questions designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences. The benchmark further supplies a six-type error analysis that enables fine-grained attribution of failures in interleaved image-text understanding to the specific capabilities missing in current MLLMs.

What carries the argument

The COHERENCE benchmark itself, which supplies questions requiring evidence identification, fine-grained image-text alignment, and contextual reasoning across interleaved material, plus a six-type error taxonomy for attributing model failures.

Load-bearing premise

The selected questions and error categories accurately and without bias capture the fine-grained alignment and reasoning abilities that interleaved contexts demand.

What would settle it

If leading MLLMs score highly on COHERENCE yet continue to fail at locating and aligning evidence when given real interleaved documents, the benchmark does not measure the intended capabilities.

read the original abstract

In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as document reading, information is often presented as interleaved multimodel contexts. This requires MLLMs not only to recognize the content of individual images, but also to identify relevant textual and visual evidence, establish fine-grained alignments between them, and reason over these aligned signals in interleaved contexts based on contextual evidence. However, there is still a lack of systematic benchmarks for quantifying the fine-grained understanding ability of MLLMs in interleaved image-text contexts. To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal contexts. COHERENCE covers interleaved image-text content from four representative domains and contains 6,161 high-quality questions. Moreover, we perform a six-type error analysis, enabling fine-grained attribution of failures in interleaved image-text understanding to the specific capabilities missing in current MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COHERENCE fills a real gap with a benchmark for interleaved image-text alignment and a six-type error breakdown, but its value rests on details of question construction that need checking.

read the letter

The main takeaway is that this paper creates COHERENCE, a benchmark of 6161 questions across four domains to test MLLMs on fine-grained correspondences in interleaved image-text sequences. It also supplies a six-type error taxonomy to attribute failures more precisely than overall scores allow. That directly targets a limitation in current tests, which mostly stick to isolated images or separate multi-image setups rather than the mixed flows common in documents and reports. The error categories give a practical way to see whether models struggle with alignment, context tracking, or reasoning over linked signals. The paper does a clean job laying out why interleaved contexts matter and why prior benchmarks fall short on this capability. The scale and domain coverage look reasonable on the surface for guiding model work. The soft spots center on validation. The abstract labels the questions high-quality, but the construction process, filtering steps, and checks for annotation bias or domain skew are not visible here. If the full methods show clear human validation, inter-annotator stats, or sample questions that avoid leading phrasing, the results gain weight. Without that, any reported failure rates could partly reflect how the test was built rather than pure model limits. The stress-test point on sampling bias is fair to raise until the paper demonstrates otherwise. This work is aimed at researchers building or evaluating MLLMs for practical use cases like document reading. A reader focused on model diagnostics would get direct use from the error taxonomy. It deserves peer review because the gap it targets is genuine and the scale is large enough for referees to verify the details and suggest fixes where needed.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces COHERENCE, a benchmark for assessing MLLMs' fine-grained image-text alignment and reasoning capabilities in interleaved multimodal contexts. It covers content from four representative domains, contains 6,161 questions, and includes a six-type error analysis to attribute model failures to specific missing capabilities.

Significance. If the questions prove high-quality and unbiased, COHERENCE would address a clear gap in existing benchmarks that focus mainly on single- or multi-image tasks, offering a tool to measure interleaved alignment and contextual reasoning relevant to real-world applications such as document understanding. The error typology could help pinpoint targeted improvements in MLLMs.

major comments (2)

[Benchmark construction] Benchmark construction section: the claim that the 6,161 questions are 'high-quality' and free of systematic bias in domain sampling or annotation requires explicit details on the question-generation pipeline, human annotation guidelines, filtering criteria, and any inter-annotator agreement metrics; without these, the central validity of the benchmark cannot be assessed.
[Error analysis] Error analysis section: the six error types need quantitative distributions across the dataset plus concrete examples tied to specific questions to demonstrate they comprehensively and non-overlappingly capture failure modes; current attribution of MLLM failures rests on this taxonomy being exhaustive.

minor comments (1)

[Abstract / Introduction] The abstract states coverage of 'four representative domains' but does not name them or justify representativeness; this should be clarified early in the introduction for reader orientation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and will revise the manuscript to strengthen the presentation of the benchmark construction and error analysis.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the claim that the 6,161 questions are 'high-quality' and free of systematic bias in domain sampling or annotation requires explicit details on the question-generation pipeline, human annotation guidelines, filtering criteria, and any inter-annotator agreement metrics; without these, the central validity of the benchmark cannot be assessed.

Authors: We agree that explicit details on the question-generation pipeline, human annotation guidelines, filtering criteria, and inter-annotator agreement metrics are required to fully substantiate the claims of high quality and absence of systematic bias. While the manuscript outlines the overall construction process across the four domains, we will expand the Benchmark Construction section in the revision to provide these specifics, including the step-by-step pipeline, guidelines provided to annotators, filtering rules applied, and any agreement statistics computed. revision: yes
Referee: [Error analysis] Error analysis section: the six error types need quantitative distributions across the dataset plus concrete examples tied to specific questions to demonstrate they comprehensively and non-overlappingly capture failure modes; current attribution of MLLM failures rests on this taxonomy being exhaustive.

Authors: We acknowledge that quantitative distributions of the six error types and concrete examples linked to specific questions are necessary to demonstrate that the taxonomy is comprehensive and non-overlapping. In the revised manuscript, we will add a table reporting the distribution of each error type across the full set of 6,161 questions and include representative examples from the dataset for each type, with direct ties to the questions and model responses to illustrate the distinctions and coverage of failure modes. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction

full rationale

The paper proposes COHERENCE as a new benchmark consisting of 6,161 questions drawn from four domains, accompanied by a six-type error taxonomy. No equations, parameter fits, predictions, or derivations are present that could reduce to the inputs by construction. The central claim rests on the independent creation and curation of the dataset itself rather than on any self-referential chain or renamed prior result. Minor self-citation risk is possible in related work sections but is not load-bearing for the benchmark's validity or reported findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that interleaved contexts require distinct fine-grained alignment skills and that the constructed questions measure them; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Interleaved image-text contexts require specific fine-grained alignment and contextual reasoning beyond single-image or multi-image comprehension
Explicitly stated in the abstract as the motivation and gap being filled.

pith-pipeline@v0.9.0 · 5521 in / 1134 out tokens · 32386 ms · 2026-05-14T21:13:02.189186+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 25 canonical work pages · 5 internal anchors

[1]

Lawrence Zitnick, Dhruv Batra, and Devi Parikh

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: Visual question answering, 2016. 1

work page 2016
[2]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, R...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens, 2024

Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, and Ludwig Schmidt. Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens, 2024. 2.1, 5.3

work page 2024
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024. 2.2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Comm: A coherent interleaved image-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2406.10462, 2024

Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, and Long Chen. Comm: A coherent interleaved image-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2406.10462, 2024. 3.1.2

work page arXiv 2024
[7]

Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation, 2024

Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation, 2024. 2.1

work page 2024
[8]

Cl-bench: A benchmark for context learning, 2026

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. Cl-bench: A benchmar...

work page 2026
[9]

Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025. 1

work page 2025
[10]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive,

work page
[11]

Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models, 2024

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models, 2024. 1 10 Benchmarking Fine-Grained Image-Text Alignment in I...

2024
[12]

Step3-vl-10b technical report, 2026

Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, D...

2026
[13]

Faithscore: Fine-grainedevaluationsofhallucinations in large vision-language models, 2024

LiqiangJing, RuosenLi, YunmoChen, andXinyaDu. Faithscore: Fine-grainedevaluationsofhallucinations in large vision-language models, 2024. 1

2024
[14]

M. G. KENDALL. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 06 1938. 4.1

1938
[15]

Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023. 2.1, 5.3

work page 2023
[16]

What matters when building vision- language models?, 2024

Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision- language models?, 2024. 2.1

work page 2024
[17]

Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023. 1

2023
[18]

Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023. 1

work page arXiv 2023
[19]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 1, 2.2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Llava- next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava- next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024. 2.2

2024
[21]

Markuplm: Pre-training of text and markup language for visually-rich document understanding, 2022

Junlong Li, Yiheng Xu, Lei Cui, and Furu Wei. Markuplm: Pre-training of text and markup language for visually-rich document understanding, 2022. 1

2022
[22]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 2.1

2023
[23]

Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text. InThe Thirteenth International Conference on Learning Representations, 2025. 2.1, 5.3

2025
[24]

Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl,

Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiahen...
[25]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 2.1

2023
[26]

Mibench: Evaluating multimodal large language models over multiple images, 2024

Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, and Weiming Hu. Mibench: Evaluating multimodal large language models over multiple images, 2024. 1, 2.2 11 Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

2024
[27]

Holistic evaluation for interleaved text-and-image generation, 2024

Minqian Liu, Zhiyang Xu, Zihao Lin, Trevor Ashby, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. Holistic evaluation for interleaved text-and-image generation, 2024. 2.2

2024
[28]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023. 2.2, 4.3

2023
[29]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024. 1, 2.2

2024
[30]

Ii-bench: An image implication understanding benchmark for multimodal large language models.arXiv preprint arXiv:2406.05862, 2024

Ziqiang Liu, Feiteng Fang, Xi Feng, Xinrun Du, Chenhao Zhang, Zekun Wang, Yuelin Bai, Qixuan Zhao, Liyang Fan, Chengguang Gan, et al. Ii-bench: An image implication understanding benchmark for multimodal large language models.arXiv preprint arXiv:2406.05862, 2024

work page arXiv 2024
[31]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 1

2022
[32]

Mayer.Cognitive Theory of Multimedia Learning, page 31–48

Richard E. Mayer.Cognitive Theory of Multimedia Learning, page 31–48. Cambridge Handbooks in Psychology. Cambridge University Press, 2005. 1

2005
[33]

Mm1: Methods, analysis & insights from multimodal llm pre-training,

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui...
[34]

Mmiu: Multimodal multi-image understanding for evaluating large vision-language models, 2024

Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models, 2024. 1, 2.2

2024
[35]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alex...

2024
[36]

Oxford University Press, 09 1990

Allan Paivio.Mental Representations: A dual coding approach. Oxford University Press, 09 1990. 1

1990
[37]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. 4.2.1

2026
[38]

Object hallucination in image captioning, 2019

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning, 2019. 1

2019
[39]

Milebench: Benchmarking mllms in long context.arXiv preprint arXiv:2404.18532, 2024

Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, and Benyou Wang. Milebench: Benchmarking mllms in long context.arXiv preprint arXiv:2404.18532, 2024. 2.2

work page arXiv 2024
[40]

Chameleon: Mixed-modal early-fusion foundation models, 2025

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2025. 2.1

work page 2025
[41]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...

work page 2026
[42]

Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...

work page 2025
[43]

Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024. 1, 2.2

work page arXiv 2024
[44]

Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture,

Xidong Wang, Dingjie Song, Shunian Chen, Junyin Chen, Zhenyang Cai, Chen Zhang, Lichao Sun, and Benyou Wang. Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture,

work page
[45]

Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences, 2024

Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, and Furong Huang. Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences, 2024. 2.2 14 Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

2024
[46]

Mmie: Massive multimodal interleaved comprehension benchmark for large vision-language models.arXiv preprint arXiv:2410.10139, 2024

Peng Xia, Siwei Han, Shi Qiu, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang, and Huaxiu Yao. Mmie: Massive multimodal interleaved comprehension benchmark for large vision-language models.arXiv preprint arXiv:2410.10139, 2024. 1, 2.2

work page arXiv 2024
[47]

Docagent: A multi-agent system for automated code documentation generation, 2025

Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. Docagent: A multi-agent system for automated code documentation generation, 2025. 5.1

2025
[48]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

2024
[49]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Can mllms understand the deep implication behind chinese images? InACL, 2025

Chenhao Zhang, Xi Feng, Yuelin Bai, Xeron Du, Jinchang Hou, Kaixin Deng, Guangzeng Han, Qinrui Li, Bingli Wang, Jiaheng Liu, Xingwei Qu, Yifei Zhang, Qixuan Zhao, Yiming Liang, Ziqiang Liu, Feiteng Fang, Min Yang, Wenhao Huang, Chenghua Lin, Ge Zhang, and Shiwen Ni. Can mllms understand the deep implication behind chinese images? InACL, 2025. 1

2025
[51]

Long context transfer from language to vision, 2024

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision, 2024. 4.3

2024
[52]

Self-taught agentic long context understanding, 2025

Yufan Zhuang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Jingbo Shang, Zicheng Liu, and Emad Barsoum. Self-taught agentic long context understanding, 2025. 5.1

2025
[53]

Intern-s1-pro: Scientific multimodal foundation model at trillion scale, 2026

Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei...

2026
[54]

The surrounding text describes what should be shown in that image

**Read the text carefully**: Each [IMAGE_PLACEHOLDER] appears within a specific context. The surrounding text describes what should be shown in that image
[55]

**Analyze each placeholder**: For each placeholder (in order from first to last), identify what the nearby text is describing - this tells you what the image should show
[56]

**Match images to placeholders**: Look at the {len(image_sequence)} candidate images provided and determine which image best matches the context around each placeholder
[57]

, ".join([

**Important**: The same image index can only be used once. Each placeholder needs a different image. ## Output Format: First reason step by step, then output your final answer on the LAST line as a Python list: - Format: [{", ".join(["index" + str(i) for i in range(num_placeholders)])}] - The list position corresponds to the placeholder order (first place...
[58]

one paragraph explaining why the prediction is wrong,
[59]

one primary error type,
[60]

This is a long-context interleaved image-text assignment task

two secondary error types. This is a long-context interleaved image-text assignment task. You must jointly evaluate text evidence, image evidence, and structural constraints. Do not judge from local fragments only. Use exactly one label for primary_error_type and exactly two labels for secondary_error_types. Allowed labels (exact string match) and interpr...
[61]

Global Assignment Drift: Local image-text pairings may look reasonable, but the final mapping is globally inconsistent across the full article (e.g., systematic shift, wrong overall alignment, boundary drift)
[62]

Step-State Confusion: The mismatch mainly comes from mixing up nearby steps/states that are semantically close or visually similar
[63]

Fine-Detail Miss: The mismatch is caused by missing decisive fine-grained cues (small objects, subtle state changes, local attributes, tool/action details)
[64]

Semantic Over-Interpretation: The mismatch is driven by reading more meaning into an image than the visible evidence supports, then forcing that interpretation into alignment
[65]

Visual Hallucination: The reasoning relies on visual elements that are not actually present in the image
[66]

error_reason

Instruction Violation: The output breaks task constraints or format requirements (e.g., invalid list format, duplicate image index use, missing/extra assignments, illegal indices). [Original Prompt] {prompt_text (including text_with_placeholders and image_list)} [Model Output] {raw_output} [Prediction] {prediction} [Gold Answer] {answer} Return exactly on...
[67]

Time to watch Birdflix

Show starts→TV screen with logo (Image 0) 2. Excitement→Bird cheering (Image 3) 3. Shock→ Bird surprised (Image 2) 4. Despair→Bird upset (Image 1) This seems logical. Let’s verify: - Image 0: TV screen with logo — matches "Time to watch Birdflix" (the show is starting). - Image 3: Bird smiling — matches "Go team go!" (happy, cheering). - Image 2: Bird sho...
[71]

error_reason

**Fourth placeholder context**: The section explains that trauma, PTSD, and related mental health conditions increase false memories and False Memory Syndrome. Image 3 matches this: it shows a distressed, sad person, visually representing the distress associated with trauma and related mental health struggles. [0, 1, 2, 3] Error Analysis "error_reason": "...
[72]

Image 0 matches this best: it shows a person with a thought bubble containing a blurry, fuzzy letter, visually representing fuzzy gist memory

**First placeholder context**: The section explains that memories are constructive, introduces verbatim/gist memory types, and fuzzy trace theory where fuzzy, meaning memories are often blurry and not perfectly precise. Image 0 matches this best: it shows a person with a thought bubble containing a blurry, fuzzy letter, visually representing fuzzy gist memory
[73]

Image 1 matches this: it shows one person whispering to another, directly representing another person sharing information/suggestions to someone else

**Second placeholder context**: The section discusses how suggestions from other people (like leading questions from police/prosecutors) lead to false memories. Image 1 matches this: it shows one person whispering to another, directly representing another person sharing information/suggestions to someone else
[74]

**Third placeholder context**: The section covers how active imagination and creativity lead to more false memories, as creative people add details to imagined scenarios. Image 2 matches this: it shows a person with a thought bubble containing lightbulbs (representing ideas/imagination) around a head silhouette, visually representing creative thought and ...
[75]

error_reason

**Fourth placeholder context**: The section explains that trauma, PTSD, and related mental health conditions increase false memories and False Memory Syndrome. Image 3 matches this: it shows a distressed, sad person, visually representing the distress associated with trauma and related mental health struggles. [1, 0, 3, 2] Error Analysis "error_reason": "...
[76]

**First placeholder**: The surrounding text is about envisioning your desired future goals (success, happiness, financial security etc.). Image 1 perfectly matches this, as it shows a person thinking about a trophy (success), money bag (financial stability), and smiley face (happiness), all typical desired outcomes for life changes
[77]

Image 0 shows a person thinking of a book, directly matching the author/book example in this section

**Second placeholder**: The text uses becoming a famous author as the core example for planning lifestyle change steps, centered on writing books. Image 0 shows a person thinking of a book, directly matching the author/book example in this section
[78]

Image 4 depicts putting a coin into a piggy bank, which is a direct visual representation of saving money, so it matches this section

**Third placeholder**: The section focuses on saving money as a safety net for life changes. Image 4 depicts putting a coin into a piggy bank, which is a direct visual representation of saving money, so it matches this section
[79]

looking up information online, studying your new path)

**Fourth placeholder**: The text advises researching and getting educated about your planned change (e.g. looking up information online, studying your new path). Image 3 shows a person working on a computer, which aligns with researching/studying to get informed about the change
[80]

Image 2 shows a person waving goodbye to another person, which fits the description of polite, amicable separation from old connections

**Fifth placeholder**: The section covers respectfully leaving your old life and saying goodbye to old ties without burning bridges. Image 2 shows a person waving goodbye to another person, which fits the description of polite, amicable separation from old connections
[81]

error_reason

**Sixth placeholder**: The text is about committing to daily action to make your new life a reality, resulting in a full transformation. Image 5 shows a before-and-after of the same person (old self vs new transformed self), which directly represents the successful life change outcome described here. [1, 0, 4, 3, 2, 5] Error Analysis "error_reason": "The ...
[82]

This is the last, and simplest of three cocktail recipes in my Edible Cucumber Cup Cocktail series

**First placeholder: "This is the last, and simplest of three cocktail recipes in my Edible Cucumber Cup Cocktail series."** - This is the introductory sentence. The context is about the "Edible Cucumber Cup Cocktail series" and the "last" recipe. - The image should show the finished product of this recipe. The text mentions "sake sized cups" and "fill ’e...
[83]

** - This text describes the action of filling the cucumber cups with sake. The key detail is

**Second placeholder: "Fill ’em Up! Once you’ve hollowed your sake sized cups (for instructions on how to make the cucumber cups, click here ), using a clean baster, fill cups with your favorite kind of cold sake! *sake is available at most places that wine and liquor are sold"** - This text describes the action of filling the cucumber cups with sake. The...

Showing first 80 references.