SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces

Masahiro Suzuki; Shohei Taniguchi; Yutaka Matsuo; Yuta Oshima

arxiv: 2403.07711 · v5 · pith:67J25CW3new · submitted 2024-03-12 · 💻 cs.CV · cs.AI

SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces

Yuta Oshima , Shohei Taniguchi , Masahiro Suzuki , Yutaka Matsuo This is my paper

classification 💻 cs.CV cs.AI

keywords modelsvideogenerationdiffusionssmsattentionfeaturesmemory

0 comments

read the original abstract

Given the remarkable achievements in image generation through diffusion models, the research community has shown increasing interest in extending these models to video generation. Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features. However, attention layers are limited by their computational costs, which increase quadratically with the sequence length. This limitation presents significant challenges when generating longer video sequences using diffusion models. To overcome this challenge, we propose leveraging state-space models (SSMs) as temporal feature extractors. SSMs (e.g., Mamba) have recently gained attention as promising alternatives due to their linear-time memory consumption relative to sequence length. In line with previous research suggesting that using bidirectional SSMs is effective for understanding spatial features in image generation, we found that bidirectionality is also beneficial for capturing temporal features in video data, rather than relying on traditional unidirectional SSMs. We conducted comprehensive evaluations on multiple long-term video datasets, such as MineRL Navigate, across various model sizes. For sequences up to 256 frames, SSM-based models require less memory to achieve the same FVD as attention-based models. Moreover, SSM-based models often deliver better performance with comparable GPU memory usage. Our codes are available at https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DIM-WAM: World-Action Modeling with Diverse Historical Event Memory
cs.RO 2026-06 unverdicted novelty 6.0

DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
A Survey of Mamba
cs.LG 2024-08 unverdicted novelty 2.0

The paper consolidates existing research on Mamba models, their architecture variants, adaptations to different data modalities, and applications across domains.