Towards VQA Models That Can Read

Amanpreet Singh; Devi Parikh; Dhruv Batra; Marcus Rohrbach; Meet Shah; Vivek Natarajan; Xinlei Chen; Yu Jiang

arxiv: 1904.08920 · v2 · pith:DD6HZH6Anew · submitted 2019-04-18 · 💻 cs.CL · cs.CV· cs.LG

Towards VQA Models That Can Read

Amanpreet Singh , Vivek Natarajan , Meet Shah , Yu Jiang , Xinlei Chen , Dhruv Batra , Devi Parikh , Marcus Rohrbach This is my paper

classification 💻 cs.CL cs.CVcs.LG

keywords imagetexttextvqadatasetanswermodelsquestionsread

0 comments

read the original abstract

Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models
cs.CV 2025-08 unverdicted novelty 6.0

Fourier Compressor uses FFT to remove frequency-domain redundancy from visual tokens in VLMs, retaining over 96% accuracy with up to 83.8% FLOP reduction.
Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression
cs.CL 2024-06 unverdicted novelty 5.0

Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.