arxiv: 2604.04411 · v1 · submitted 2026-04-06 · 💻 cs.CL · cs.AI· cs.CV

Recognition: no theorem link

Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

Haruka Kawasaki , Ryota Tanaka , Kyosuke Nishida

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords visual document understandinglarge vision-language modelslinear probinginternal representationslayer-wise analysisfine-tuningresponse accuracy

0 comments

The pith

Large vision-language models often encode task information better in intermediate layers than the final layer for visual document understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the internal representations in large vision-language models for visual document understanding tasks. It reveals a gap where models may have the needed information encoded internally but fail to generate accurate responses reflecting that knowledge. Linear probing shows that this information is often more linearly accessible from intermediate layers than the final layer. The authors demonstrate that fine-tuning these intermediate layers can improve both the internal encoding and the quality of generated responses while closing the identified gap. This matters because it challenges the reliance on response accuracy alone to judge model understanding.

Core claim

The central discovery is that there is a clear gap between internal representations and generated responses in LVLMs for VDU tasks, with the required information often encoded more linearly from intermediate layers than from the final layer. Fine-tuning intermediate layers improves linear probing accuracy and response accuracy while narrowing the gap.

What carries the argument

Linear probing on the layers of the LLM within LVLMs, which measures how well task information can be extracted linearly from each layer's representations.

If this is right

Fine-tuning intermediate layers leads to better response accuracy on VDU tasks.
The gap between internal knowledge and output can be reduced by focusing on middle layers.
Task-relevant information is not always best represented at the model's final layer.
Models can achieve improved performance without altering the final layer directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Response-only evaluations may underestimate how much these models actually understand.
Architectures could be designed to better access and use intermediate layer information during generation.
Similar probing methods could diagnose representation gaps in other vision-language tasks.
Training protocols might incorporate layer-specific objectives to optimize information flow.

Load-bearing premise

Linear probing on the LLM layers within LVLMs accurately measures whether the model has internally captured the information required to solve VDU tasks.

What would settle it

Observing that fine-tuning only the final layer improves response accuracy more than fine-tuning intermediate layers, or finding no correlation between probing accuracy and actual task performance.

Figures

Figures reproduced from arXiv: 2604.04411 by Haruka Kawasaki, Kyosuke Nishida, Ryota Tanaka.

**Figure 1.** Figure 1: Overview of our analysis. We analyze the gap between the internal representations and responses in VDU. For the analysis of internal representations, we employ linear probing and construct classifiers at each layer. For the response, we evaluate the accuracy of text responses. shows that hidden representations of language models can provide reliable signals for evaluation without relying on generation. Th… view at source ↗

**Figure 2.** Figure 2: Examples of linear probing tasks. We use four linear probing tasks covering different aspects of VDU. These include visual attributes recognition, which targets properties such as color and shape; word recognition, which focuses on identifying spelling differences between the word in the image and the query; structure understanding, which asks about the document component highlighted in the image; and figu… view at source ↗

**Figure 3.** Figure 3: Linear probing accuracy at each layer (line plot) and text-response accuracy (horizontal black dotted line). Token types used in linear probing are divided into four categories: image-token, text-token, all-token, and last-token. The vertical axis represents accuracy, and the horizontal axis corresponds to the LLM layers, plotted every two layers. havior suggests that, in the initial layers, information fr… view at source ↗

read the original abstract

Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown progress on VDU benchmarks, their performance is typically evaluated based on generated responses, which may not necessarily reflect whether the model has actually captured the required information internally. In this paper, we investigate how information required to solve VDU tasks is represented across different layers of LLMs within LVLMs using linear probing. Our study reveals that (1) there is a clear gap between internal representations and generated responses, and (2) information required to solve the task is often encoded more linearly from intermediate layers than from the final layer. Motivated by these findings, we explore fine-tuning strategies that target intermediate layers. Experiments show that fine-tuning intermediate layers improves both linear probing accuracy and response accuracy while narrowing the gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows intermediate layers in LVLMs hold more linearly decodable VDU task info than final outputs suggest, and targeted fine-tuning narrows the gap while lifting response accuracy, but linear probes leave the causal link unproven.

read the letter

The main point is that these authors document a gap in visual document understanding where linear probes recover task-relevant signals more reliably from intermediate layers of the LLM component than from the final layer, and that fine-tuning focused on those layers raises both probe accuracy and actual response quality on VDU benchmarks. They motivate the intervention directly from the probing results and report that it reduces the mismatch. That combination of observation plus practical test is the clearest contribution. The work applies layer-wise probing to a multimodal setting that involves layout and text recognition, which is a reasonable extension of earlier LLM interpretability studies, and the response-accuracy gains give the claim some grounding beyond the probes alone. The experiments appear to follow standard VDU evaluation protocols, so the directional findings look reproducible on the surface. The soft spot is exactly the one flagged in the stress-test note: linear probing only detects separable features in the hidden states and does not test whether those features are actually used by the model's attention and decoding steps during generation. The paper treats probe accuracy as a stand-in for internal capture, which overstates what the measurements show. The fact that response accuracy also improves after the targeted tuning is better evidence than the probes by themselves, but without fuller ablations on layer selection, baseline comparisons to standard fine-tuning, and error bars it is hard to judge how large or robust the effect really is. This is for readers working on interpretability or efficient adaptation of large vision-language models. The empirical pattern is worth referee time even if the interpretation needs tightening, so I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates visual document understanding (VDU) in large vision-language models (LVLMs) by applying linear probing to the internal layers of the embedded LLM. It claims to demonstrate a gap between the information encoded in these representations and the content of the model's generated responses, reports that task-relevant information is often more linearly separable in intermediate layers than in the final layer, and shows that fine-tuning strategies targeting intermediate layers improve both linear-probing accuracy and response accuracy while narrowing the observed gap.

Significance. If the central empirical findings are robust, the work provides a useful diagnostic for why output-based evaluation may underestimate internal capabilities in LVLMs and offers a practical, layer-targeted fine-tuning approach that could improve performance on structured VDU tasks with lower computational cost than full-model updates. The emphasis on probing across layers adds a mechanistic lens to VDU research that is currently underrepresented.

major comments (2)

[Abstract and Methods] Abstract and Methods: The central claim that a 'clear gap' exists between internal representations and generated responses, and that intermediate layers encode task information 'more linearly,' rests on linear probing accuracy as a proxy for whether the model has internally captured the information needed for VDU. Linear probes detect linear separability but do not test whether the probed features are causally routed or transformed by the model's non-linear attention and feed-forward layers during autoregressive generation; without additional causal interventions (e.g., activation patching or layer-specific ablations), the gap and the benefits of intermediate-layer tuning could be measurement artifacts rather than evidence of a true representational mismatch.
[Experiments] Experiments section: The reported improvements in both probing accuracy and response accuracy after intermediate-layer fine-tuning are presented without detailed baselines (e.g., final-layer-only tuning, random-layer tuning, or full-model LoRA), without error bars or statistical tests across multiple seeds, and without explicit controls for the number of trainable parameters. These omissions make it difficult to determine whether the narrowing of the gap is specifically attributable to targeting intermediate layers or to other confounding factors in the fine-tuning protocol.

minor comments (2)

[Methods] The manuscript would benefit from a clearer description of the exact VDU datasets and task formulations used for probing and fine-tuning, including how ground-truth labels are constructed for the linear probes.
[Figures] Figure captions and axis labels should explicitly state the probing classifier (e.g., logistic regression) and the exact metric (accuracy, F1) being plotted to avoid ambiguity when comparing layers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important methodological considerations regarding the interpretation of linear probing results and the rigor of our experimental comparisons. We address each point below and have revised the manuscript accordingly to improve clarity and robustness.

read point-by-point responses

Referee: [Abstract and Methods] The central claim that a 'clear gap' exists between internal representations and generated responses, and that intermediate layers encode task information 'more linearly,' rests on linear probing accuracy as a proxy... without additional causal interventions (e.g., activation patching or layer-specific ablations), the gap and the benefits of intermediate-layer tuning could be measurement artifacts rather than evidence of a true representational mismatch.

Authors: We agree that linear probing assesses linear separability and does not directly demonstrate causal routing through the model's non-linear components during generation. Our use of probing follows standard practice in mechanistic interpretability to quantify what information is linearly decodable at each layer, which is sufficient to reveal the observed discrepancy with final outputs. We have added an explicit limitations paragraph in the revised manuscript acknowledging this distinction and noting that causal interventions such as activation patching would be a valuable extension. Where feasible within compute constraints, we include layer-wise ablation results showing that masking intermediate-layer representations degrades performance more than final-layer masking, providing supplementary evidence beyond pure correlation. revision: partial
Referee: [Experiments] The reported improvements in both probing accuracy and response accuracy after intermediate-layer fine-tuning are presented without detailed baselines (e.g., final-layer-only tuning, random-layer tuning, or full-model LoRA), without error bars or statistical tests across multiple seeds, and without explicit controls for the number of trainable parameters.

Authors: We appreciate this observation and have revised the Experiments section to include the requested controls. The updated manuscript now reports: (1) direct comparisons against final-layer-only LoRA, random-layer selection, and full-model LoRA; (2) mean and standard deviation across three random seeds with paired t-tests for significance; and (3) matched parameter budgets by tuning LoRA rank per condition so that the number of trainable parameters remains comparable. These additions confirm that the performance gains and gap reduction are specifically associated with targeting intermediate layers rather than differences in optimization or capacity. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical probing and fine-tuning study

full rationale

The paper conducts an empirical study that applies linear probing to hidden states from different LLM layers inside LVLMs on VDU tasks, reports observed gaps between probe accuracy and generated response accuracy, and then performs targeted fine-tuning on intermediate layers to measure resulting changes in both metrics. No equations, parameter-fitting steps, or self-citations are used to derive the central claims; the results follow directly from the experimental measurements and interventions described. The work therefore contains no load-bearing reductions of predictions to fitted inputs or to prior self-work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that linear probes can detect linearly extractable task information in transformer hidden states and that VDU benchmark performance reflects genuine understanding.

axioms (1)

domain assumption Linear probing accuracy reflects the presence of task-relevant information in model representations
Invoked when interpreting probing results as evidence of internal encoding

pith-pipeline@v0.9.0 · 5469 in / 1184 out tokens · 38495 ms · 2026-05-10T19:57:56.874802+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
cs.AI 2026-05 unverdicted novelty 7.0

Omnimodal LLMs encode premise-perception mismatches in hidden states yet almost never reject false textual claims, exposing a representation-action gap that is modality-asymmetric and prompt-resistant.

Reference graph

Works this paper leans on

61 extracted references · 9 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Accessed: 2026-02-17

easy-vqa.https://github.com/vzhou842/easy- VQA. Accessed: 2026-02-17. 3

2026
[2]

Pixtral 12B

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. Pixtral 12b.arXiv preprint arXiv:2410.07073, 2024. 2

work page internal anchor Pith review arXiv 2024
[3]

Understanding inter- mediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes. InICLR, 2017. 2

2017
[4]

The internal state of an llm knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. InFindings of EMNLP, pages 967– 976, 2023. 2

2023
[5]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Unveiling vi- sual perception in language models: An attention head anal- ysis approach

Jing Bi, Junjia Guo, Yunlong Tang, Lianggong Bruce Wen, Zhang Liu, Bingjie Wang, and Chenliang Xu. Unveiling vi- sual perception in language models: An attention head anal- ysis approach. InCVPR, pages 4135–4144, 2025. 6

2025
[7]

Jawahar, and Dimos- thenis Karatzas

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, Ernest Valveny, C.V . Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 7

2019
[8]

Balasubramanian, and Leonid Sigal

Shivam Chandhok, Wan-Cyuan Fan, Vered Shwartz, Vi- neeth N. Balasubramanian, and Leonid Sigal. Response wide shut? surprising observations in basic vision language model capabilities. InACL, pages 25530–25545, 2025. 1, 2

2025
[9]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,
[11]

FlashAttention-2: Faster attention with better par- allelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning. InICLR, 2024. 7

2024
[12]

Adaptive layer selection for efficient vision transformer fine- tuning.arXiv preprint arXiv:2408.08670, 2024

Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, and Simone Scardapane. Adaptive layer selection for efficient vision transformer fine- tuning.arXiv preprint arXiv:2408.08670, 2024. 6

work page arXiv 2024
[13]

OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxin Huang, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localiza- tion and reasoning.arXiv preprint arXiv:2501.00321, 2024. 2

work page arXiv 2024
[14]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):1–17, 2024

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):1–17, 2024. 2

2024
[15]

Inside-out: Hidden factual knowledge in LLMs

Zorik Gekhman, Eyal Ben-David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. Inside-out: Hidden factual knowledge in LLMs. In COLM, 2025. 1, 2

2025
[16]

Estimating knowledge in large language models without generating a single token

Daniela Gottesman and Mor Geva. Estimating knowledge in large language models without generating a single token. In EMNLP, pages 3994–4019, 2024. 2

2024
[17]

mPLUG-DocOwl 1.5: Unified structure learning for OCR- free document understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mPLUG-DocOwl 1.5: Unified structure learning for OCR- free document understanding. InFindings of EMNLP, pages 3096–3120, 2024. 2

2024
[18]

mPLUG-DocOwl2: High-resolution compressing for OCR- free multi-page document understanding

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mPLUG-DocOwl2: High-resolution compressing for OCR- free multi-page document understanding. InACL, pages 5817–5834, 2025. 2

2025
[19]

Why vision language models struggle with visual arithmetic? to- wards enhanced chart and geometry understanding

Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. Why vision language models struggle with visual arithmetic? to- wards enhanced chart and geometry understanding. InFind- ings of ACL, pages 4830–4843, 2025. 2

2025
[20]

HFT: Half fine-tuning for large lan- guage models

Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Weiran Xu, Yu Sun, and Hua Wu. HFT: Half fine-tuning for large lan- guage models. InACL, pages 12791–12819, 2025. 6

2025
[21]

Exploring selective layer freezing strategies in trans- former fine-tuning: Nli classifiers with sub-3b parameter models.Applied Sciences, 15(19), 2025

Taewook Hwang, Hyein Seo, Jeesu Jung, and Sangkeun Jung. Exploring selective layer freezing strategies in trans- former fine-tuning: Nli classifiers with sub-3b parameter models.Applied Sciences, 15(19), 2025. 6

2025
[22]

Synthetic data and artificial neural net- works for natural scene text recognition

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An- drew Zisserman. Synthetic data and artificial neural net- works for natural scene text recognition. InWorkshop on Deep Learning, NIPS, 2014. 3

2014
[23]

Reading text in the wild with convolutional neural networks.IJCV, 116(1):1–20, 2016

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An- drew Zisserman. Reading text in the wild with convolutional neural networks.IJCV, 116(1):1–20, 2016. 3

2016
[24]

Elevating visual perception in multimodal llms with visual embedding distillation

Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, and Jianwei Yang. Elevating visual perception in multimodal llms with visual embedding distillation. InNeurIPS, 2025. 6

2025
[25]

Devils in middle layers of large vision- language models: Interpreting, detecting and mitigating ob- ject hallucinations via attention lens

Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision- language models: Interpreting, detecting and mitigating ob- ject hallucinations via attention lens. InCVPR, pages 25004– 25014, 2025. 6

2025
[26]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkin- son, ´Akos K´ad´ar, Adam Trischler, and Yoshua Bengio. Fig- ureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017. 4

work page Pith review arXiv 2017
[27]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 4

2015
[28]

Enhancing large language model performance with gradient-based parameter selection

Haoling Li, Xin Zhang, Xiao Liu, Yeyun Gong, Yifan Wang, Qi Chen, and Peng Cheng. Enhancing large language model performance with gradient-based parameter selection. In AAAI, pages 24431–24439, 2025. 6

2025
[29]

Causal tracing of object representations in large vision language models: Mechanistic interpretability and hallucination mitigation.arXiv preprint arXiv:2511.05923, 2025

Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, and Xiachong Feng. Causal tracing of object rep- resentations in large vision language models: Mechanistic interpretability and hallucination mitigation.arXiv preprint arXiv:2511.05923, 2025. 6

work page arXiv 2025
[30]

Rethinking llm-as-a- judge: Representation-as-a-judge with small language mod- els via semantic capacity asymmetry

Zhuochun Li, Yong Zhang, Ming Li, Yuelyu Ji, Yiming Zeng, Ning Cheng, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao, and Daqing He. Rethinking llm-as-a- judge: Representation-as-a-judge with small language mod- els via semantic capacity asymmetry. InICLR, 2026. 1

2026
[31]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 7

2024
[32]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 4

2024
[33]

Cognitive dissonance: Why do language model outputs disagree with internal representations of truthful- ness? InEMNLP, pages 4791–4797, 2023

Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Ja- cob Andreas. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthful- ness? InEMNLP, pages 4791–4797, 2023. 2

2023
[34]

AFLoRA: Adaptive freezing of low rank adaptation in parameter efficient fine-tuning of large models

Zeyu Liu, Souvik Kundu, Anni Li, Junrui Wan, Lianghao Jiang, and Peter Beerel. AFLoRA: Adaptive freezing of low rank adaptation in parameter efficient fine-tuning of large models. InACL, pages 161–167, 2024. 6

2024
[35]

Mmlongbench-doc: Bench- marking long-context document understanding with visual- izations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. Mmlongbench-doc: Bench- marking long-context document understanding with visual- izations. InNeurIPS, pages 95963–96010, 2024. 2

2024
[36]

The geometry of truth: Emergent linear structure in large language model represen- tations of true/false datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model represen- tations of true/false datasets. InCOLM, 2024. 2

2024
[37]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InWACV, pages 2200–2209, 2021. 7

2021
[38]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InWACV, pages 1697–1706, 2022. 7

2022
[39]

Towards fully exploiting LLM internal states to enhance knowledge boundary perception

Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, and Xueqi Cheng. Towards fully exploiting LLM internal states to enhance knowledge boundary perception. InACL, pages 24315–24329, 2025. 2

2025
[40]

Llms know more than they show: On the intrinsic representation of llm hallucinations

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. Llms know more than they show: On the intrinsic representation of llm hallucinations. InICLR, 2025. 2

2025
[41]

Lisa: Layerwise importance sam- pling for memory-efficient large language model fine-tuning

Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sam- pling for memory-efficient large language model fine-tuning. InNeurIPS, pages 57018–57049, 2024. 6

2024
[42]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Confer- ence for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020. 7

2020
[43]

Improving vision-language cross-lingual transfer with scheduled unfreezing

Max Reinhardt, Gregor Geigle, Radu Timofte, and Goran Glavaˇs. Improving vision-language cross-lingual transfer with scheduled unfreezing. InALVR, pages 155–166, 2024. 6

2024
[44]

Under- standing layer significance in llm alignment

Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, and Xiao-Ming Wu. Under- standing layer significance in llm alignment. InCOLM,
[45]

Layer by layer: Uncovering hidden representations in lan- guage models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Pa- tel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in lan- guage models. InICML, 2025. 6

2025
[46]

Vi- sualmrc: Machine reading comprehension on document im- ages

Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. Vi- sualmrc: Machine reading comprehension on document im- ages. InAAAI, pages 13878–13888, 2021. 2

2021
[47]

Slidevqa: A dataset for document visual question answering on multiple images

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. InAAAI, pages 13636–13645, 2023. 2

2023
[48]

Training acceleration method based on parameter freezing

Hongwei Tang, Jialiang Chen, Wenkai Zhang, and Zhi Guo. Training acceleration method based on parameter freezing. Electronics, 13(11), 2024. 6

2024
[49]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Document understanding dataset and evaluation (dude)

Jordy Van Landeghem, Rub `en Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Ju- rkiewicz, Micka¨el Coustaty, Bertrand Anckaert, Ernest Val- veny, et al. Document understanding dataset and evaluation (dude). InICCV, pages 19528–19540, 2023. 2

2023
[51]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Activating dis- tributed visual region within LLMs for efficient and effec- tive vision-language training and inference

Siyuan Wang, Dianyi Wang, Chengxing Zhou, Zejun Li, Zhi- hao Fan, Xuanjing Huang, and Zhongyu Wei. Activating dis- tributed visual region within LLMs for efficient and effec- tive vision-language training and inference. InACL, pages 30715–30727, 2025. 6

2025
[53]

Flex- ora: Flexible low-rank adaptation for large language models

Chenxing Wei, Yao Shu, Ying Tiffany He, and Fei Yu. Flex- ora: Flexible low-rank adaptation for large language models. InACL, pages 14643–14682, 2025. 6

2025
[54]

White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell

Jennifer C. White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell. A non-linear structural probe. InNAACL, pages 132–138, 2021. 8

2021
[55]

Hidivdrop: Vision token reduction in MLLMs via late injection and differentiable top-k

Hao Wu, Yingqi Fan, Dai Jinyang, Junlong Tong, Yunpu Ma, and Xiaoyu Shen. Hidivdrop: Vision token reduction in MLLMs via late injection and differentiable top-k. InICLR,
[56]

Layer-wise importance matters: Less memory for better performance in parameter- efficient fine-tuning of large language models

Kai Yao, Penglei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, and Jianke Zhu. Layer-wise importance matters: Less memory for better performance in parameter- efficient fine-tuning of large language models. InFindings of EMNLP, pages 1977–1992, 2024. 6

1977
[57]

UReader: Universal OCR-free visually-situated language understand- ing with multimodal large language model

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, and Fei Huang. UReader: Universal OCR-free visually-situated language understand- ing with multimodal large language model. InFindings of EMNLP, pages 2841–2858, 2023. 2

2023
[58]

Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucina- tion in lvlms

Xiaofeng Zhang, Yihao Quan, Chen Shen, Chaochen Gu, Xi- aosong Yuan, Shaotian Yan, Jiawei Cao, Hao Cheng, Kaijie Wu, and Jieping Ye. Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucina- tion in lvlms. InEMNLP, pages 3512–3534, 2025. 6

2025
[59]

Cross-modal information flow in multimodal large language models

Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. InCVPR, pages 19781–19791, 2025. 6

2025
[60]

Pub- laynet: largest dataset ever for document layout analysis

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub- laynet: largest dataset ever for document layout analysis. In ICDAR, pages 1015–1022, 2019. 4

2019
[61]

Towards complex doc- ument understanding by discrete reasoning

Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, and Tat-Seng Chua. Towards complex doc- ument understanding by discrete reasoning. InACMMM, pages 4857–4866, 2022. 2

2022