hub Mixed citations

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin · 2025 · cs.CV · arXiv 2502.10248

Mixed citation behavior. Most common role is background (60%).

34 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 34 citing papers arXiv PDF

abstract

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 3

citation-polarity summary

background 6 baseline 3 unclear 1

representative citing papers

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Future Forcing constructs a future query proxy from historical pre-RoPE statistics to score and merge KV tokens, improving subject consistency by up to 1.49 on VBench-Long for 60s AR video generation.

MechVerse: Evaluating Physical Motion Consistency in Video Generation Models

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

MechVerse benchmark shows current video generation models preserve appearance but fail at mechanically admissible motion, with errors rising as coupling complexity increases.

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

Efficient Video Diffusion Models: Advancements and Challenges

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.

GenHSI: Controllable Generation of Human-Scene Interaction Videos

cs.CV · 2025-06-24 · unverdicted · novelty 7.0

GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.

Ink3D: Sculpting 3D Assets with Extremely Complex Textures via Video Generative Models

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

Ink3D decouples geometry from texture by generating dense orbit videos with a conditional video model and baking them via a neural optimizer to produce complex 3D textures.

ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

ARGUS converts MLLM-selected identity evidence into a synchronized 3x3 mosaic injected as negative-time memory in a diffusion model, plus supporting training techniques, to achieve SOTA subject preservation on human video benchmarks.

Veda: Scalable Video Diffusion via Distilled Sparse Attention

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

Veda formulates tile selection in video diffusion attention as a reconstruction problem from full attention maps, using statistics-aware and head-aware scoring to enable high sparsity with maintained quality and hardware speedups up to 5.1x end-to-end.

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

AtlasVid proposes a decoupled global-local diffusion framework that trains at low resolution with LoRA and generalizes to ultra-high-resolution long video synthesis via semantic proxy guidance and locality-preserving attention.

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.

Qwen-Image-VAE-2.0 Technical Report

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HorizonDrive is a new anti-drifting autoregressive training and distillation method that enables minute-scale stable driving video rollouts by making the teacher model rollout-capable via scheduled rollout recovery and teacher rollout DMD.

Leveraging Verifier-Based Reinforcement Learning in Image Editing

cs.CV · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.

DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.

SynthForensics: Benchmarking and Evaluating People-Centric Synthetic Video Deepfakes

cs.CV · 2026-02-04 · unverdicted · novelty 6.0

SynthForensics is a people-centric benchmark where face-based detectors lose 13-55 AUC points on modern synthetic videos compared to legacy manipulation sets.

HunyuanVideo 1.5 Technical Report

cs.CV · 2025-11-24 · unverdicted · novelty 6.0

HunyuanVideo 1.5 delivers state-of-the-art open-source text-to-video and image-to-video generation with an 8.3B parameter DiT model featuring SSTA attention, glyph-aware encoding, and progressive training.

Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

cs.CV · 2025-09-29 · unverdicted · novelty 6.0

A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.

Listener-Rewarded Thinking in VLMs for Image Preferences

cs.CV · 2025-06-28 · unverdicted · novelty 6.0

Listener-augmented GRPO uses an independent frozen VLM to provide dense confidence scores on reasoning traces, yielding 67.4% accuracy on ImageReward, up to +6% OOD gains on 1.2M-vote human data, and fewer reasoning contradictions.

MAGI-1: Autoregressive Video Generation at Scale

cs.CV · 2025-05-19 · unverdicted · novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

cs.CV · 2025-03-27 · accept · novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.

citing papers explorer

Showing 1 of 1 citing paper after filters.

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness cs.CV · 2025-03-27 · accept · none · ref 62 · internal anchor
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer