DP-FM decouples radial and angular dynamics on a cylindrical manifold via constant-warping metric and classifier-free guidance to achieve state-of-the-art multi-step few-shot adaptation of vision-language models on 11 benchmarks.
Scalable diffusion models with transformers
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% better accuracy than prior methods.
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
citing papers explorer
-
Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation
DP-FM decouples radial and angular dynamics on a cylindrical manifold via constant-warping metric and classifier-free guidance to achieve state-of-the-art multi-step few-shot adaptation of vision-language models on 11 benchmarks.
-
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% better accuracy than prior methods.
-
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.