pith. sign in

arxiv: 2603.09095 · v2 · pith:ZHZ4H5KKnew · submitted 2026-03-10 · 💻 cs.CL · cs.CV

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

classification 💻 cs.CL cs.CV
keywords modelsreasoningtextimagesmodalitymultimodalperformanceunderstanding
0
0 comments X
read the original abstract

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the gap is highly sensitive to rendering choices such as font and resolution, and that natural document images often exhibit much smaller gaps, suggesting the performance difference partly reflects evaluation artifacts rather than fundamental limitations. Through a grounded-theory error analysis of over 4,000 examples, we identify the primary cause: image input alone suppresses reasoning effort, with models producing 5--19x shorter outputs that skip step-by-step computation or reasoning. The reluctance to reason, not a failure of perception or knowledge retrieval, drives the performance gap, particularly on tasks requiring multi-step reasoning. We show that a simple, lightweight on-policy self-distillation method by fine-tuning models on their own text-mode reasoning traces paired with image inputs closes this gap, raising image-mode accuracy to match or exceed text-mode performance with over 50\% improvement, and the gains transfer to unseen benchmarks without catastrophic forgetting. Overall, our results and analyses provide a systematic understanding of the modality gap and suggest a practical path toward improving visual text understanding in multimodal language models.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

    cs.AI 2026-05 unverdicted novelty 5.0

    Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.

  2. LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

    cs.CV 2026-05 unverdicted novelty 5.0

    LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.