Learning to Count Objects in Natural Images for Visual Question Answering

Adam Pr\"ugel-Bennett; Jonathon Hare; Yan Zhang

Learning to Count Objects in Natural Images for Visual Question Answering

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1802.05766 v1 pith:76MCJFKS submitted 2018-02-15 cs.CV cs.CL

Learning to Count Objects in Natural Images for Visual Question Answering

Yan Zhang , Jonathon Hare , Adam Pr\"ugel-Bennett This is my paper

classification cs.CV cs.CL

keywords componentcountingmodelsansweringimagesnaturalobjectsproblem

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting
cs.CV 2026-05 conditional novelty 7.0

MixCount provides a scalable synthetic dataset for mixed-object counting that improves state-of-the-art models on real benchmarks, cutting MAE by 20.14% on FSC-147 and 18.3% on PairTally.
HoloCount: A Holistic Visual Counting Benchmark for MLLMs
cs.CV 2026-07 conditional novelty 6.0

HoloCount is a three-tier visual counting benchmark showing that MLLMs fail systematically on analytical reasoning, high-density scenes, and linguistic prior conflicts, with even the best models dropping below 50% acc...
Blind-Spots-Bench: Evaluating Blind Spots in Multimodal Models
cs.AI 2026-07 conditional novelty 5.0

A 235-item multimodal stress-test shows frontier closed models outpace open-weight peers by ~10% and leaves shared failures on counting, spatial, and character-level tasks.