hub Canonical reference

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, Pieter Abbeel · 2024 · cs.LG · arXiv 2410.12557

Canonical reference. 89% of citing Pith papers cite this work as background.

48 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 48 citing papers arXiv PDF

abstract

Diffusion models and flow-matching models have enabled generating diverse and realistic images by learning to transfer noise to data. However, sampling from these models involves iterative denoising over many neural network passes, making generation slow and expensive. Previous approaches for speeding up sampling require complex training regimes, such as multiple training phases, multiple networks, or fragile scheduling. We introduce shortcut models, a family of generative models that use a single network and training phase to produce high-quality samples in a single or multiple sampling steps. Shortcut models condition the network not only on the current noise level but also on the desired step size, allowing the model to skip ahead in the generation process. Across a wide range of sampling step budgets, shortcut models consistently produce higher quality samples than previous approaches, such as consistency models and reflow. Compared to distillation, shortcut models reduce complexity to a single network and training phase and additionally allow varying step budgets at inference time.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1

citation-polarity summary

background 8 baseline 1

representative citing papers

How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

cs.LG · 2026-04-29 · unverdicted · novelty 8.0 · 3 refs

FMRG reformulates guidance as deterministic optimal control, deriving a single-trajectory method using the flow map that matches or exceeds baselines on reward-guided generation and inverse problems with 3 NFEs at text-to-image scale.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

cs.CL · 2026-06-07 · accept · novelty 7.0

Naive samplers beat published diffusion and flow models on gen-PPL with incoherent output, proving the metric unsound and motivating distributional evaluation suites.

Multimarginal flow matching with optimal transport potentials

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

OTP-FM extends conditional flow matching by incorporating dynamic optimal transport potentials to enable efficient multimarginal transport learning with intermediate observed marginals.

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

DriftXpress: Faster Drifting Models via Projected RKHS Fields

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

DriftXpress approximates drifting kernels via projected RKHS fields to lower training cost of one-step generative models while matching original FID scores.

Isokinetic Flow Matching for Pathwise Straightening of Generative Flows

cs.LG · 2026-04-06 · unverdicted · novelty 7.0

Isokinetic Flow Matching adds a lightweight regularization term to flow matching that penalizes acceleration along paths via self-guided finite differences, yielding straighter trajectories and large gains in few-step sampling quality on CIFAR-10.

VOSR: A Vision-Only Generative Model for Image Super-Resolution

cs.CV · 2026-04-03 · conditional · novelty 7.0

VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a restoration-oriented sampling strategy.

DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

cs.CV · 2026-02-05 · unverdicted · novelty 7.0

DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.

Training Agents Inside of Scalable World Models

cs.AI · 2025-09-29 · conditional · novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

Lipschitz-Guided Design of Interpolation Schedules in Generative Models

stat.ML · 2025-09-01 · unverdicted · novelty 7.0

Minimizing averaged squared Lipschitzness of the drift produces interpolation schedules that improve numerical accuracy and mitigate mode collapse in generative models, with closed-form optima for Gaussians and validation on stochastic PDEs.

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

cs.CV · 2025-04-29 · unverdicted · novelty 7.0

ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.

Generative Model Proposal based Particle Filtering for Data Assimilation

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

FPPF uses a learned conditional generative proposal approximating the optimal proposal in particle filters, with tractable likelihoods for Bayesian updates and localization for high dimensions, outperforming baselines on nonlinear non-Gaussian systems.

Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps

cs.LG · 2026-06-27 · unverdicted · novelty 6.0

SCALLOP replaces Hutchinson's trace estimator with a scalable, vectorized likelihood distillation objective for F2D2 flow maps, cutting training variance and time while improving performance on molecular Boltzmann generators and image data.

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

cs.CV · 2026-06-23 · conditional · novelty 6.0

NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.

Posterior Refinement: Fast Language Generation via Any-Order Flow Maps

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

FMLM+ with Posterior Refinement bridges masked diffusion and flow map models to match discrete baseline quality in language generation using 32x fewer neural function evaluations via posterior scoring and refinement.

Flash-WAM: Modality-Aware Distillation for World Action Models

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

Flash-WAM introduces modality-specific consistency parametrizations to distill joint video-action diffusion models to single-step inference, delivering 23x speedup with preserved benchmark performance.

MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

MORPHOS introduces an autoregressive 4D generation method with Temporal Structured Latents (T-SLAT) that produces dynamic 3D assets from videos while handling topological changes and long sequences.

Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry

cs.RO · 2026-05-31 · unverdicted · novelty 6.0

IDP generates one-step robot actions by adaptively weighting a scalar potential objective using conditional expert geometry derived from local variations of observation-similar expert actions, combined with expert-proximal terminal evaluation.

Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

Stochastic Lifting adds random labels to training transitions to train a regression model that generates diverse stochastic trajectories without collapsing to mean predictions.

Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

A multi-agent video world model using simplex rotary agent encoding and sparse hub attention achieves better fidelity, controllability, and consistency than baselines while generalizing from 2 to 4 players.

MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

Presents MRT, a 20B-parameter masked region diffusion model unifying text-to-layers, image-to-layers, and layers-to-layers tasks with an overflow-aware canvas layer for complete editable outputs.

citing papers explorer

Showing 21 of 21 citing papers after filters.

MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction cs.CV · 2026-06-29 · unverdicted · none · ref 27 · internal anchor
MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.
HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention cs.CV · 2026-05-14 · unverdicted · none · ref 6 · internal anchor
HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
VOSR: A Vision-Only Generative Model for Image Super-Resolution cs.CV · 2026-04-03 · conditional · none · ref 14 · internal anchor
VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a restoration-oriented sampling strategy.
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching cs.CV · 2026-02-05 · unverdicted · none · ref 12 · internal anchor
DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer cs.CV · 2025-04-29 · unverdicted · none · ref 44 · internal anchor
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
DiffusionBench: On Holistic Evaluation of Diffusion Transformers cs.CV · 2026-06-23 · conditional · none · ref 138 · internal anchor
NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.
MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents cs.CV · 2026-06-01 · unverdicted · none · ref 10 · internal anchor
MORPHOS introduces an autoregressive 4D generation method with Temporal Structured Latents (T-SLAT) that produces dynamic 3D assets from videos while handling topological changes and long sequences.
Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players cs.CV · 2026-05-27 · unverdicted · none · ref 12 · internal anchor
A multi-agent video world model using simplex rotary agent encoding and sparse hub attention achieves better fidelity, controllability, and consistency than baselines while generalizing from 2 to 4 players.
MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale cs.CV · 2026-05-26 · unverdicted · none · ref 12 · internal anchor
Presents MRT, a 20B-parameter masked region diffusion model unifying text-to-layers, image-to-layers, and layers-to-layers tasks with an overflow-aware canvas layer for complete editable outputs.
Efficient Image Synthesis with Sphere Latent Encoder cs.CV · 2026-05-15 · unverdicted · none · ref 7 · internal anchor
Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation cs.CV · 2026-04-28 · unverdicted · none · ref 11 · internal anchor
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Self-Adversarial One Step Generation via Condition Shifting cs.CV · 2026-04-14 · unverdicted · none · ref 6 · internal anchor
APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length cs.CV · 2025-12-04 · conditional · none · ref 12 · internal anchor
Live Avatar enables 45 FPS real-time streaming infinite-length audio-driven avatar generation from a 14B diffusion model via distillation and timestep-forcing pipeline parallelism.
SAM 3D: 3Dfy Anything in Images cs.CV · 2025-11-20 · unverdicted · none · ref 9 · internal anchor
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution cs.CV · 2026-05-25 · unverdicted · none · ref 39 · internal anchor
PixelWizard decouples global structure from fine details via a spatiotemporal anchor and introduces Noise-Span Aligned Shortcut Training with biased sampling to achieve over 10x faster sampling for high-fidelity 2K/4K video generation.
Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion cs.CV · 2026-05-18 · unverdicted · none · ref 14 · internal anchor
Focused Forcing is a training-free per-frame KV selection method that combines attention scores with diversity metrics and head-importance estimation to accelerate autoregressive video diffusion up to 1.48x while improving quality.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling cs.CV · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.
Stabilizing, Scaling & Enhancing MeanFlow for Large-scale Diffusion Distillation cs.CV · 2026-05-18 · unverdicted · none · ref 6 · internal anchor
Stabilizes MeanFlow for large-scale diffusion distillation via discrete warm-up and trajectory alignment, reporting better results on FLUX.1-dev and HunyuanImage 3.0.
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation cs.CV · 2026-04-03 · unreviewed · ref 7 · internal anchor
Dual-End Consistency Model cs.CV · 2026-02-11 · unreviewed · ref 9 · internal anchor
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling cs.CV · 2025-12-16 · unreviewed · ref 12 · internal anchor

One Step Diffusion via Shortcut Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer