Recognition: no theorem link
Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding
Pith reviewed 2026-05-10 19:57 UTC · model grok-4.3
The pith
Large vision-language models often encode task information better in intermediate layers than the final layer for visual document understanding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that there is a clear gap between internal representations and generated responses in LVLMs for VDU tasks, with the required information often encoded more linearly from intermediate layers than from the final layer. Fine-tuning intermediate layers improves linear probing accuracy and response accuracy while narrowing the gap.
What carries the argument
Linear probing on the layers of the LLM within LVLMs, which measures how well task information can be extracted linearly from each layer's representations.
If this is right
- Fine-tuning intermediate layers leads to better response accuracy on VDU tasks.
- The gap between internal knowledge and output can be reduced by focusing on middle layers.
- Task-relevant information is not always best represented at the model's final layer.
- Models can achieve improved performance without altering the final layer directly.
Where Pith is reading between the lines
- Response-only evaluations may underestimate how much these models actually understand.
- Architectures could be designed to better access and use intermediate layer information during generation.
- Similar probing methods could diagnose representation gaps in other vision-language tasks.
- Training protocols might incorporate layer-specific objectives to optimize information flow.
Load-bearing premise
Linear probing on the LLM layers within LVLMs accurately measures whether the model has internally captured the information required to solve VDU tasks.
What would settle it
Observing that fine-tuning only the final layer improves response accuracy more than fine-tuning intermediate layers, or finding no correlation between probing accuracy and actual task performance.
Figures
read the original abstract
Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown progress on VDU benchmarks, their performance is typically evaluated based on generated responses, which may not necessarily reflect whether the model has actually captured the required information internally. In this paper, we investigate how information required to solve VDU tasks is represented across different layers of LLMs within LVLMs using linear probing. Our study reveals that (1) there is a clear gap between internal representations and generated responses, and (2) information required to solve the task is often encoded more linearly from intermediate layers than from the final layer. Motivated by these findings, we explore fine-tuning strategies that target intermediate layers. Experiments show that fine-tuning intermediate layers improves both linear probing accuracy and response accuracy while narrowing the gap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates visual document understanding (VDU) in large vision-language models (LVLMs) by applying linear probing to the internal layers of the embedded LLM. It claims to demonstrate a gap between the information encoded in these representations and the content of the model's generated responses, reports that task-relevant information is often more linearly separable in intermediate layers than in the final layer, and shows that fine-tuning strategies targeting intermediate layers improve both linear-probing accuracy and response accuracy while narrowing the observed gap.
Significance. If the central empirical findings are robust, the work provides a useful diagnostic for why output-based evaluation may underestimate internal capabilities in LVLMs and offers a practical, layer-targeted fine-tuning approach that could improve performance on structured VDU tasks with lower computational cost than full-model updates. The emphasis on probing across layers adds a mechanistic lens to VDU research that is currently underrepresented.
major comments (2)
- [Abstract and Methods] Abstract and Methods: The central claim that a 'clear gap' exists between internal representations and generated responses, and that intermediate layers encode task information 'more linearly,' rests on linear probing accuracy as a proxy for whether the model has internally captured the information needed for VDU. Linear probes detect linear separability but do not test whether the probed features are causally routed or transformed by the model's non-linear attention and feed-forward layers during autoregressive generation; without additional causal interventions (e.g., activation patching or layer-specific ablations), the gap and the benefits of intermediate-layer tuning could be measurement artifacts rather than evidence of a true representational mismatch.
- [Experiments] Experiments section: The reported improvements in both probing accuracy and response accuracy after intermediate-layer fine-tuning are presented without detailed baselines (e.g., final-layer-only tuning, random-layer tuning, or full-model LoRA), without error bars or statistical tests across multiple seeds, and without explicit controls for the number of trainable parameters. These omissions make it difficult to determine whether the narrowing of the gap is specifically attributable to targeting intermediate layers or to other confounding factors in the fine-tuning protocol.
minor comments (2)
- [Methods] The manuscript would benefit from a clearer description of the exact VDU datasets and task formulations used for probing and fine-tuning, including how ground-truth labels are constructed for the linear probes.
- [Figures] Figure captions and axis labels should explicitly state the probing classifier (e.g., logistic regression) and the exact metric (accuracy, F1) being plotted to avoid ambiguity when comparing layers.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments highlight important methodological considerations regarding the interpretation of linear probing results and the rigor of our experimental comparisons. We address each point below and have revised the manuscript accordingly to improve clarity and robustness.
read point-by-point responses
-
Referee: [Abstract and Methods] The central claim that a 'clear gap' exists between internal representations and generated responses, and that intermediate layers encode task information 'more linearly,' rests on linear probing accuracy as a proxy... without additional causal interventions (e.g., activation patching or layer-specific ablations), the gap and the benefits of intermediate-layer tuning could be measurement artifacts rather than evidence of a true representational mismatch.
Authors: We agree that linear probing assesses linear separability and does not directly demonstrate causal routing through the model's non-linear components during generation. Our use of probing follows standard practice in mechanistic interpretability to quantify what information is linearly decodable at each layer, which is sufficient to reveal the observed discrepancy with final outputs. We have added an explicit limitations paragraph in the revised manuscript acknowledging this distinction and noting that causal interventions such as activation patching would be a valuable extension. Where feasible within compute constraints, we include layer-wise ablation results showing that masking intermediate-layer representations degrades performance more than final-layer masking, providing supplementary evidence beyond pure correlation. revision: partial
-
Referee: [Experiments] The reported improvements in both probing accuracy and response accuracy after intermediate-layer fine-tuning are presented without detailed baselines (e.g., final-layer-only tuning, random-layer tuning, or full-model LoRA), without error bars or statistical tests across multiple seeds, and without explicit controls for the number of trainable parameters.
Authors: We appreciate this observation and have revised the Experiments section to include the requested controls. The updated manuscript now reports: (1) direct comparisons against final-layer-only LoRA, random-layer selection, and full-model LoRA; (2) mean and standard deviation across three random seeds with paired t-tests for significance; and (3) matched parameter budgets by tuning LoRA rank per condition so that the number of trainable parameters remains comparable. These additions confirm that the performance gains and gap reduction are specifically associated with targeting intermediate layers rather than differences in optimization or capacity. revision: yes
Circularity Check
No circularity in empirical probing and fine-tuning study
full rationale
The paper conducts an empirical study that applies linear probing to hidden states from different LLM layers inside LVLMs on VDU tasks, reports observed gaps between probe accuracy and generated response accuracy, and then performs targeted fine-tuning on intermediate layers to measure resulting changes in both metrics. No equations, parameter-fitting steps, or self-citations are used to derive the central claims; the results follow directly from the experimental measurements and interventions described. The work therefore contains no load-bearing reductions of predictions to fitted inputs or to prior self-work by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear probing accuracy reflects the presence of task-relevant information in model representations
Forward citations
Cited by 1 Pith paper
-
Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
Omnimodal LLMs encode premise-perception mismatches in hidden states yet almost never reject false textual claims, exposing a representation-action gap that is modality-asymmetric and prompt-resistant.
Reference graph
Works this paper leans on
-
[1]
Accessed: 2026-02-17
easy-vqa.https://github.com/vzhou842/easy- VQA. Accessed: 2026-02-17. 3
2026
-
[2]
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. Pixtral 12b.arXiv preprint arXiv:2410.07073, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[3]
Understanding inter- mediate layers using linear classifier probes
Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes. InICLR, 2017. 2
2017
-
[4]
The internal state of an llm knows when it’s lying
Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. InFindings of EMNLP, pages 967– 976, 2023. 2
2023
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Unveiling vi- sual perception in language models: An attention head anal- ysis approach
Jing Bi, Junjia Guo, Yunlong Tang, Lianggong Bruce Wen, Zhang Liu, Bingjie Wang, and Chenliang Xu. Unveiling vi- sual perception in language models: An attention head anal- ysis approach. InCVPR, pages 4135–4144, 2025. 6
2025
-
[7]
Jawahar, and Dimos- thenis Karatzas
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, Ernest Valveny, C.V . Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 7
2019
-
[8]
Balasubramanian, and Leonid Sigal
Shivam Chandhok, Wan-Cyuan Fan, Vered Shwartz, Vi- neeth N. Balasubramanian, and Leonid Sigal. Response wide shut? surprising observations in basic vision language model capabilities. InACL, pages 25530–25545, 2025. 1, 2
2025
-
[9]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,
-
[11]
FlashAttention-2: Faster attention with better par- allelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning. InICLR, 2024. 7
2024
-
[12]
Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, and Simone Scardapane. Adaptive layer selection for efficient vision transformer fine- tuning.arXiv preprint arXiv:2408.08670, 2024. 6
-
[13]
Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxin Huang, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localiza- tion and reasoning.arXiv preprint arXiv:2501.00321, 2024. 2
-
[14]
Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):1–17, 2024
Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):1–17, 2024. 2
2024
-
[15]
Inside-out: Hidden factual knowledge in LLMs
Zorik Gekhman, Eyal Ben-David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. Inside-out: Hidden factual knowledge in LLMs. In COLM, 2025. 1, 2
2025
-
[16]
Estimating knowledge in large language models without generating a single token
Daniela Gottesman and Mor Geva. Estimating knowledge in large language models without generating a single token. In EMNLP, pages 3994–4019, 2024. 2
2024
-
[17]
mPLUG-DocOwl 1.5: Unified structure learning for OCR- free document understanding
Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mPLUG-DocOwl 1.5: Unified structure learning for OCR- free document understanding. InFindings of EMNLP, pages 3096–3120, 2024. 2
2024
-
[18]
mPLUG-DocOwl2: High-resolution compressing for OCR- free multi-page document understanding
Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mPLUG-DocOwl2: High-resolution compressing for OCR- free multi-page document understanding. InACL, pages 5817–5834, 2025. 2
2025
-
[19]
Why vision language models struggle with visual arithmetic? to- wards enhanced chart and geometry understanding
Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. Why vision language models struggle with visual arithmetic? to- wards enhanced chart and geometry understanding. InFind- ings of ACL, pages 4830–4843, 2025. 2
2025
-
[20]
HFT: Half fine-tuning for large lan- guage models
Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Weiran Xu, Yu Sun, and Hua Wu. HFT: Half fine-tuning for large lan- guage models. InACL, pages 12791–12819, 2025. 6
2025
-
[21]
Exploring selective layer freezing strategies in trans- former fine-tuning: Nli classifiers with sub-3b parameter models.Applied Sciences, 15(19), 2025
Taewook Hwang, Hyein Seo, Jeesu Jung, and Sangkeun Jung. Exploring selective layer freezing strategies in trans- former fine-tuning: Nli classifiers with sub-3b parameter models.Applied Sciences, 15(19), 2025. 6
2025
-
[22]
Synthetic data and artificial neural net- works for natural scene text recognition
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An- drew Zisserman. Synthetic data and artificial neural net- works for natural scene text recognition. InWorkshop on Deep Learning, NIPS, 2014. 3
2014
-
[23]
Reading text in the wild with convolutional neural networks.IJCV, 116(1):1–20, 2016
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An- drew Zisserman. Reading text in the wild with convolutional neural networks.IJCV, 116(1):1–20, 2016. 3
2016
-
[24]
Elevating visual perception in multimodal llms with visual embedding distillation
Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, and Jianwei Yang. Elevating visual perception in multimodal llms with visual embedding distillation. InNeurIPS, 2025. 6
2025
-
[25]
Devils in middle layers of large vision- language models: Interpreting, detecting and mitigating ob- ject hallucinations via attention lens
Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision- language models: Interpreting, detecting and mitigating ob- ject hallucinations via attention lens. InCVPR, pages 25004– 25014, 2025. 6
2025
-
[26]
FigureQA: An Annotated Figure Dataset for Visual Reasoning
Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkin- son, ´Akos K´ad´ar, Adam Trischler, and Yoshua Bengio. Fig- ureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017. 4
work page Pith review arXiv 2017
-
[27]
Adam: A method for stochastic optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 4
2015
-
[28]
Enhancing large language model performance with gradient-based parameter selection
Haoling Li, Xin Zhang, Xiao Liu, Yeyun Gong, Yifan Wang, Qi Chen, and Peng Cheng. Enhancing large language model performance with gradient-based parameter selection. In AAAI, pages 24431–24439, 2025. 6
2025
-
[29]
Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, and Xiachong Feng. Causal tracing of object rep- resentations in large vision language models: Mechanistic interpretability and hallucination mitigation.arXiv preprint arXiv:2511.05923, 2025. 6
-
[30]
Rethinking llm-as-a- judge: Representation-as-a-judge with small language mod- els via semantic capacity asymmetry
Zhuochun Li, Yong Zhang, Ming Li, Yuelyu Ji, Yiming Zeng, Ning Cheng, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao, and Daqing He. Rethinking llm-as-a- judge: Representation-as-a-judge with small language mod- els via semantic capacity asymmetry. InICLR, 2026. 1
2026
-
[31]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 7
2024
-
[32]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 4
2024
-
[33]
Cognitive dissonance: Why do language model outputs disagree with internal representations of truthful- ness? InEMNLP, pages 4791–4797, 2023
Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Ja- cob Andreas. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthful- ness? InEMNLP, pages 4791–4797, 2023. 2
2023
-
[34]
AFLoRA: Adaptive freezing of low rank adaptation in parameter efficient fine-tuning of large models
Zeyu Liu, Souvik Kundu, Anni Li, Junrui Wan, Lianghao Jiang, and Peter Beerel. AFLoRA: Adaptive freezing of low rank adaptation in parameter efficient fine-tuning of large models. InACL, pages 161–167, 2024. 6
2024
-
[35]
Mmlongbench-doc: Bench- marking long-context document understanding with visual- izations
Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. Mmlongbench-doc: Bench- marking long-context document understanding with visual- izations. InNeurIPS, pages 95963–96010, 2024. 2
2024
-
[36]
The geometry of truth: Emergent linear structure in large language model represen- tations of true/false datasets
Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model represen- tations of true/false datasets. InCOLM, 2024. 2
2024
-
[37]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InWACV, pages 2200–2209, 2021. 7
2021
-
[38]
Infographicvqa
Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InWACV, pages 1697–1706, 2022. 7
2022
-
[39]
Towards fully exploiting LLM internal states to enhance knowledge boundary perception
Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, and Xueqi Cheng. Towards fully exploiting LLM internal states to enhance knowledge boundary perception. InACL, pages 24315–24329, 2025. 2
2025
-
[40]
Llms know more than they show: On the intrinsic representation of llm hallucinations
Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. Llms know more than they show: On the intrinsic representation of llm hallucinations. InICLR, 2025. 2
2025
-
[41]
Lisa: Layerwise importance sam- pling for memory-efficient large language model fine-tuning
Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sam- pling for memory-efficient large language model fine-tuning. InNeurIPS, pages 57018–57049, 2024. 6
2024
-
[42]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Confer- ence for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020. 7
2020
-
[43]
Improving vision-language cross-lingual transfer with scheduled unfreezing
Max Reinhardt, Gregor Geigle, Radu Timofte, and Goran Glavaˇs. Improving vision-language cross-lingual transfer with scheduled unfreezing. InALVR, pages 155–166, 2024. 6
2024
-
[44]
Under- standing layer significance in llm alignment
Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, and Xiao-Ming Wu. Under- standing layer significance in llm alignment. InCOLM,
-
[45]
Layer by layer: Uncovering hidden representations in lan- guage models
Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Pa- tel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in lan- guage models. InICML, 2025. 6
2025
-
[46]
Vi- sualmrc: Machine reading comprehension on document im- ages
Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. Vi- sualmrc: Machine reading comprehension on document im- ages. InAAAI, pages 13878–13888, 2021. 2
2021
-
[47]
Slidevqa: A dataset for document visual question answering on multiple images
Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. InAAAI, pages 13636–13645, 2023. 2
2023
-
[48]
Training acceleration method based on parameter freezing
Hongwei Tang, Jialiang Chen, Wenkai Zhang, and Zhi Guo. Training acceleration method based on parameter freezing. Electronics, 13(11), 2024. 6
2024
-
[49]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Document understanding dataset and evaluation (dude)
Jordy Van Landeghem, Rub `en Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Ju- rkiewicz, Micka¨el Coustaty, Bertrand Anckaert, Ernest Val- veny, et al. Document understanding dataset and evaluation (dude). InICCV, pages 19528–19540, 2023. 2
2023
-
[51]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Activating dis- tributed visual region within LLMs for efficient and effec- tive vision-language training and inference
Siyuan Wang, Dianyi Wang, Chengxing Zhou, Zejun Li, Zhi- hao Fan, Xuanjing Huang, and Zhongyu Wei. Activating dis- tributed visual region within LLMs for efficient and effec- tive vision-language training and inference. InACL, pages 30715–30727, 2025. 6
2025
-
[53]
Flex- ora: Flexible low-rank adaptation for large language models
Chenxing Wei, Yao Shu, Ying Tiffany He, and Fei Yu. Flex- ora: Flexible low-rank adaptation for large language models. InACL, pages 14643–14682, 2025. 6
2025
-
[54]
White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell
Jennifer C. White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell. A non-linear structural probe. InNAACL, pages 132–138, 2021. 8
2021
-
[55]
Hidivdrop: Vision token reduction in MLLMs via late injection and differentiable top-k
Hao Wu, Yingqi Fan, Dai Jinyang, Junlong Tong, Yunpu Ma, and Xiaoyu Shen. Hidivdrop: Vision token reduction in MLLMs via late injection and differentiable top-k. InICLR,
-
[56]
Layer-wise importance matters: Less memory for better performance in parameter- efficient fine-tuning of large language models
Kai Yao, Penglei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, and Jianke Zhu. Layer-wise importance matters: Less memory for better performance in parameter- efficient fine-tuning of large language models. InFindings of EMNLP, pages 1977–1992, 2024. 6
1977
-
[57]
UReader: Universal OCR-free visually-situated language understand- ing with multimodal large language model
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, and Fei Huang. UReader: Universal OCR-free visually-situated language understand- ing with multimodal large language model. InFindings of EMNLP, pages 2841–2858, 2023. 2
2023
-
[58]
Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucina- tion in lvlms
Xiaofeng Zhang, Yihao Quan, Chen Shen, Chaochen Gu, Xi- aosong Yuan, Shaotian Yan, Jiawei Cao, Hao Cheng, Kaijie Wu, and Jieping Ye. Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucina- tion in lvlms. InEMNLP, pages 3512–3534, 2025. 6
2025
-
[59]
Cross-modal information flow in multimodal large language models
Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. InCVPR, pages 19781–19791, 2025. 6
2025
-
[60]
Pub- laynet: largest dataset ever for document layout analysis
Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub- laynet: largest dataset ever for document layout analysis. In ICDAR, pages 1015–1022, 2019. 4
2019
-
[61]
Towards complex doc- ument understanding by discrete reasoning
Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, and Tat-Seng Chua. Towards complex doc- ument understanding by discrete reasoning. InACMMM, pages 4857–4866, 2022. 2
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.