ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering

Haoyuan Gao; Jiang Wang; Kan Chen; Liang-Chieh Chen; Ram Nevatia; Wei Xu

arxiv: 1511.05960 · v2 · pith:JJRSOJ6Knew · submitted 2015-11-18 · 💻 cs.CV

ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering

Kan Chen , Jiang Wang , Liang-Chieh Chen , Haoyuan Gao , Wei Xu , Ram Nevatia This is my paper

classification 💻 cs.CV

keywords attentionabc-cnnquestionimageconvolutionalregionsansweringarchitecture

0 comments

read the original abstract

We propose a novel attention based deep learning architecture for visual question answering task (VQA). Given an image and an image related natural language question, VQA generates the natural language answer for the question. Generating the correct answers requires the model's attention to focus on the regions corresponding to the question, because different questions inquire about the attributes of different image regions. We introduce an attention based configurable convolutional neural network (ABC-CNN) to learn such question-guided attention. ABC-CNN determines an attention map for an image-question pair by convolving the image feature map with configurable convolutional kernels derived from the question's semantics. We evaluate the ABC-CNN architecture on three benchmark VQA datasets: Toronto COCO-QA, DAQUAR, and VQA dataset. ABC-CNN model achieves significant improvements over state-of-the-art methods on these datasets. The question-guided attention generated by ABC-CNN is also shown to reflect the regions that are highly relevant to the questions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Deep Modular Co-Attention Networks for Visual Question Answering
cs.CV 2019-06 conditional novelty 7.0

MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
cs.CV 2023-08 unverdicted novelty 6.0

DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.