Scalable diffusion models with transformers

William Peebles, Saining Xie · 2023

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

DP-FM decouples radial and angular dynamics on a cylindrical manifold via constant-warping metric and classifier-free guidance to achieve state-of-the-art multi-step few-shot adaptation of vision-language models on 11 benchmarks.

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% better accuracy than prior methods.

Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

citing papers explorer

Showing 3 of 3 citing papers.

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation cs.CV · 2026-05-06 · unverdicted · none · ref 27
DP-FM decouples radial and angular dynamics on a cylindrical manifold via constant-warping metric and classifier-free guidance to achieve state-of-the-art multi-step few-shot adaptation of vision-language models on 11 benchmarks.
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation cs.CV · 2026-04-10 · unverdicted · none · ref 22
CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% better accuracy than prior methods.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation cs.CV · 2026-05-13 · unverdicted · none · ref 4
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

Scalable diffusion models with transformers

fields

years

verdicts

representative citing papers

citing papers explorer