Learning to Count Objects in Natural Images for Visual Question Answering
read the original abstract
Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting
MixCount provides a scalable synthetic dataset for mixed-object counting that improves state-of-the-art models on real benchmarks, cutting MAE by 20.14% on FSC-147 and 18.3% on PairTally.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.