For page-like images, focus on the main text or core information rather than layout or decorative elements

Detail questions should prioritize information that is valuable for understanding the image content, avoid unnecessary or low training-value details

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.CL · 2026-03-17 · unverdicted · novelty 5.0

Knowledge density in image captions, not task format diversity, is the primary driver of multimodal LLM scaling performance.

Showing 1 of 1 citing paper.

Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling cs.CL · 2026-03-17 · unverdicted · none · ref 12
Knowledge density in image captions, not task format diversity, is the primary driver of multimodal LLM scaling performance.