Simple Baseline for Visual Question Answering
classification
💻 cs.CV
cs.CL
keywords
baselinequestionansweringfeaturessimplevisualanswerapproaches
read the original abstract
We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Deep Modular Co-Attention Networks for Visual Question Answering
MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.