Recognition: unknown
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
read the original abstract
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
TriviaQA is a new large-scale dataset for reading comprehension that features complex compositional questions, high lexical variability, and cross-sentence reasoning requirements, where current baselines reach only 40...
-
Categorical Reparameterization with Gumbel-Softmax
Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.
-
Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection
Lightweight multi-task models using Gram matrices and PatchGAN-style architectures detect 53 weather classes from RGB images with F1 scores above 96% internally and 78% zero-shot externally, supported by a new 503k-im...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.