Decomposing Motion and Content for Natural Video Sequence Prediction

Honglak Lee; Jimei Yang; Ruben Villegas; Seunghoon Hong; Xunyu Lin

arxiv: 1706.08033 · v2 · pith:RRNQ26LZnew · submitted 2017-06-25 · 💻 cs.CV

Decomposing Motion and Content for Natural Video Sequence Prediction

Ruben Villegas , Jimei Yang , Seunghoon Hong , Xunyu Lin , Honglak Lee This is my paper

classification 💻 cs.CV

keywords contentmotionpredictionnetworkvideosdynamicsmodelnatural

0 comments

read the original abstract

We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics. By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction. Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training. We evaluate the proposed network architecture on human activity videos using KTH, Weizmann action, and UCF-101 datasets. We show state-of-the-art performance in comparison to recent approaches. To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
cs.CV 2025-12 unverdicted novelty 6.0

DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
Order Matters: Shuffling Sequence Generation for Video Prediction
cs.CV 2019-07 unverdicted novelty 6.0

SEE-Net improves video prediction by using frame shuffling to enforce learning of natural temporal order, reporting state-of-the-art results on three synthetic and real-world datasets.
Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy
cs.RO 2023-04 unverdicted novelty 5.0

Visuo-tactile world models improve prediction accuracy in physically ambiguous robot-pushing scenarios, demonstrated on two new datasets with a magnetic tactile sensor.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Frame forecasting in cine MRI using the PCA respiratory motion model: comparing recurrent neural networks trained online and transformers
eess.IV 2024-10 unverdicted novelty 4.0

Online RNNs (RTRL, SnAp-1) beat linear filters and transformers at medium-to-long horizon forecasting of PCA respiratory motion weights in two cine-MRI datasets, yielding sub-1.4 mm and sub-2.8 mm geometric errors.
Planning Robot Motion using Deep Visual Prediction
cs.RO 2019-06 unverdicted novelty 3.0

PROM-Net performs unsupervised visual prediction of robot motion from raw frames and integrates the predictions into model predictive control for navigation in unknown dynamic settings.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.