Neural Language Modeling with Visual Features
read the original abstract
Multimodal language models attempt to incorporate non-linguistic features for the language modeling task. In this work, we extend a standard recurrent neural network (RNN) language model with features derived from videos. We train our models on data that is two orders-of-magnitude bigger than datasets used in prior work. We perform a thorough exploration of model architectures for combining visual and text features. Our experiments on two corpora (YouCookII and 20bn-something-something-v2) show that the best performing architecture consists of middle fusion of visual and text features, yielding over 25% relative improvement in perplexity. We report analysis that provides insights into why our multimodal language model improves upon a standard RNN language model.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems
VaFM encodes constraint-specific VRP images via CNN into patch embeddings fused with graph nodes, using an auxiliary task to handle pixel imbalance, and reports better performance than prior methods on 16 VRP variants.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.