Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Huijuan Xu; Jeff Donahue; Kate Saenko; Marcus Rohrbach; Raymond Mooney; Subhashini Venugopalan

arxiv: 1412.4729 · v3 · pith:5XAXNVYHnew · submitted 2014-12-15 · 💻 cs.CV · cs.CL

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Subhashini Venugopalan , Huijuan Xu , Jeff Donahue , Marcus Rohrbach , Raymond Mooney , Kate Saenko This is my paper

classification 💻 cs.CV cs.CL

keywords deepimageslanguagevideosbeengoalgroundingnatural

0 comments

read the original abstract

Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Demystifying CLIP Data
cs.CV 2023-09 accept novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning
cs.CV 2019-06 unverdicted novelty 6.0

The authors replace next-word log-likelihood training with word-embedding regression in an encoder-decoder captioning model and report CIDEr 125.0 and BLEU-4 50.5 on MS-COCO, exceeding prior bests of 117.1 and 48.0.
GIT: A Generative Image-to-text Transformer for Vision and Language
cs.CV 2022-05 unverdicted novelty 5.0

GIT achieves new state-of-the-art results on 12 vision-language benchmarks, including surpassing human performance on TextCaps, via a simplified single-encoder single-decoder transformer scaled on large pre-training data.