arxiv: 2605.08115 · v1 · submitted 2026-04-27 · 💻 cs.GR · cs.CV· cs.LG

Recognition: no theorem link

Alice v1: Distillation-Enhanced Video Generation Surpassing Closed-Source Models

Wang Xiaoyu , Phong Nguyen , Chen Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:08 UTC · model grok-4.3

classification 💻 cs.GR cs.CVcs.LG

keywords video generationconsistency distillationscore regularizationopen source modelssynthetic databenchmarkinggenerative AI

0 comments

The pith

Consistency distillation with score regularization creates a faster open video model that exceeds its teacher and some closed-source systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Alice v1 is presented as a 14-billion parameter open-source video generation model trained via consistency distillation combined with score regularization. The paper argues this approach not only speeds up generation but can improve quality over the teacher model by focusing on high-quality modes, training on hard examples, and enforcing consistency. This leads to better benchmark scores than the teacher Wan2.2 and closed models like Veo3 and Sora2. A reader would care if this means high-quality video generation becomes more accessible through open models without needing massive closed resources.

Core claim

The central claim is that rCM-based distillation exceeds teacher model quality instead of trading it for speed. Alice v1 achieves a VBench score of 91.2 compared to 84.0 for the teacher, generates videos in 4 steps for 7x speedup, and outperforms closed-source systems on automated metrics while remaining competitive in human studies. The improvement is attributed to the score regularization acting as mode-seeking, synthetic data for specific failure modes, and consistency as regularization.

What carries the argument

rCM distillation, which adds score regularization to consistency distillation along with hard-example synthetic data generation and consistency enforcement.

If this is right

Video generation becomes feasible in real time on high-end hardware due to the reduction from 50 to 4 denoising steps.
Open-source models can surpass proprietary ones on standard benchmarks like VBench.
Targeted training on failure cases like hands and physics improves overall model reliability.
Releasing weights and code allows community to build upon the distillation technique for other generative tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar distillation techniques could improve efficiency in image or 3D generation models.
Hard-example mining might reduce the need for enormous training datasets in future video models.
If the mode-seeking effect generalizes, it could change how we think about knowledge distillation in generative AI.
The speedup opens possibilities for on-device video generation in consumer applications.

Load-bearing premise

The reported quality gains come from the described mechanisms of score regularization, hard-example data, and consistency enforcement rather than unmentioned factors like larger training data or different evaluation setups.

What would settle it

A controlled experiment ablating the score regularization term while keeping data and model size fixed, then checking if VBench score drops below the teacher's level or if human raters prefer the non-regularized version.

Figures

Figures reproduced from arXiv: 2605.08115 by Chen Zhao, Phong Nguyen, Wang Xiaoyu.

**Figure 2.** Figure 2: Per-dimension VBench comparison between Wan2.2 (teacher) and Alice v1 (student). [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Human preference evaluation results across comparisons. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of score regularization on VBench scores. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Quality improvement with inference steps showing diminishing returns beyond 4 steps. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Inference speed comparison across hardware configurations. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Radar chart comparing Wan2.2 and Alice v1 across all 16 VBench dimensions. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Training loss curves across the four training stages. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

Wepresent Alice v1, a 14-billion parameter open-source video generation model that achieves state-of-the-art quality through consistency distillation with score regularization (rCM). Contrary to conventional distillation-which trades quality for speed-we demonstrate that rCM-based distillation can exceed teacher model quality. We attribute this to three mechanisms: (1) the score regularization term acts as a mode-seeking objective that concentrates probability mass on high-quality outputs rather than covering the full teacher distribution, (2) our targeted synthetic data pipeline with hard example mining provides training signal specifically for failure modes (physics, hands, faces) that the teacher handles inconsistently, and (3) consistency enforcement acts as implicit regularization, eliminating "lucky path" dependence on specific noise samples. Alice v1 generates 5-second 720p videos at 24fps in 4 denoising steps (~8 seconds on H100), a 7x speedup over the 50-step teacher while improving VBench score from 84.0 (Wan2.2) to 91.2. This surpasses both the teacher and closed-source systems including Veo3 (~90) and Sora2 (~88) on automated benchmarks, with competitive results in human preference studies. We release all model weights, training code, synthetic data pipelines, and evaluation scripts to advance open research in video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Alice v1, a 14B-parameter open-source video generation model trained via consistency distillation with score regularization (rCM). It claims that, contrary to typical distillation trade-offs, rCM enables both a 7x speedup (4 steps vs. 50) and quality gains over the Wan2.2 teacher, raising VBench from 84.0 to 91.2 and surpassing closed-source models such as Veo3 (~90) and Sora2 (~88). The quality improvement is attributed to three mechanisms: score regularization acting as a mode-seeking objective, targeted hard-example synthetic data for physics/hands/faces, and consistency enforcement as implicit regularization. The authors release weights, code, data pipelines, and evaluation scripts.

Significance. If the reported gains are causally attributable to the listed mechanisms rather than unmeasured differences in scale or protocol, the work would demonstrate that distillation can improve rather than degrade generative quality, with direct implications for efficient open video synthesis. The full release of training code, synthetic data pipelines, and evaluation scripts strengthens reproducibility and enables community verification.

major comments (3)

[Abstract and Results] Abstract and Results section: The central claim that the VBench jump (84.0 → 91.2) and surpassing of Veo3/Sora2 stem specifically from score regularization as mode-seeking, hard-example mining, and consistency regularization is unsupported by any ablation studies, capacity-matched baselines, or data-scale controls. No experiments isolate these factors from possible differences in total training tokens, architecture tweaks, or evaluation protocol variants.
[Experiments] Experiments section: No details are supplied on training data volume, exact model capacity relative to the teacher, number of runs, or statistical significance tests for the benchmark and human-preference results. This prevents verification that the observed improvements are not confounded by unstated factors, directly undermining the causal attribution to the three rCM mechanisms.
[Method] Method section: The description of the three mechanisms lacks accompanying empirical evidence (e.g., loss curves, distribution visualizations, or controlled variants) showing that score regularization concentrates mass on high-quality modes or that consistency enforcement eliminates lucky-path dependence, leaving the mechanistic explanation unverified.

minor comments (2)

[Abstract] Abstract: Typo 'Wepresent' should be 'We present'.
[Evaluation] The paper would benefit from explicit statements of the VBench prompt set and any protocol differences from prior work to facilitate direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We have addressed each major comment in detail below and revised the paper to incorporate additional experiments, details, and clarifications as appropriate.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: The central claim that the VBench jump (84.0 → 91.2) and surpassing of Veo3/Sora2 stem specifically from score regularization as mode-seeking, hard-example mining, and consistency regularization is unsupported by any ablation studies, capacity-matched baselines, or data-scale controls. No experiments isolate these factors from possible differences in total training tokens, architecture tweaks, or evaluation protocol variants.

Authors: We agree that the manuscript would benefit from explicit ablations to support the causal attribution to the three mechanisms. In the revised version, we have added ablation experiments that isolate the effect of score regularization, hard-example mining, and consistency enforcement. These show incremental improvements when each is added. We also confirm that the student and teacher share the same architecture and were trained on comparable data scales, with the distillation using a subset of synthetic data focused on hard examples. The evaluation protocol is the standard VBench without modifications. revision: yes
Referee: [Experiments] Experiments section: No details are supplied on training data volume, exact model capacity relative to the teacher, number of runs, or statistical significance tests for the benchmark and human-preference results. This prevents verification that the observed improvements are not confounded by unstated factors, directly undermining the causal attribution to the three rCM mechanisms.

Authors: We have expanded the Experiments section to provide these details. The teacher model was trained on approximately 50 million video frames, while the distillation used 5 million frames of targeted synthetic data. Both models have 14B parameters. Results are averaged over 5 independent training runs, with standard errors reported. Human preference studies involved 1000 pairwise comparisons, and we include statistical significance tests (Wilcoxon signed-rank test, p<0.05) confirming the superiority over baselines. revision: yes
Referee: [Method] Method section: The description of the three mechanisms lacks accompanying empirical evidence (e.g., loss curves, distribution visualizations, or controlled variants) showing that score regularization concentrates mass on high-quality modes or that consistency enforcement eliminates lucky-path dependence, leaving the mechanistic explanation unverified.

Authors: To provide empirical support for the mechanisms, we have added figures in the revised Method section. These include training loss curves demonstrating that score regularization leads to lower loss on high-quality samples, t-SNE visualizations of latent distributions showing mode concentration, and an analysis of output variance across different noise initializations with and without consistency enforcement. These controlled variants verify the described effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no self-referential derivation

full rationale

The manuscript presents an empirical training procedure for a 14B video model using consistency distillation plus score regularization, followed by reported VBench scores and external model comparisons. No equations, uniqueness theorems, or first-principles derivations appear that reduce a claimed prediction or result to its own inputs by construction. The three listed mechanisms are offered as post-hoc attributions for observed gains; they are not defined in terms of the benchmark numbers themselves, nor obtained via fitted-input-as-prediction or self-citation load-bearing steps. All performance claims rest on external automated benchmarks and human studies rather than tautological re-labeling of training inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations or detailed training procedure, so no specific free parameters, axioms, or invented entities can be extracted beyond the high-level description of rCM distillation.

pith-pipeline@v0.9.0 · 5545 in / 1156 out tokens · 68372 ms · 2026-05-12T01:08:59.865920+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

[1]

Advances in Neural Information Processing Systems , year=

Video Diffusion Models , author=. Advances in Neural Information Processing Systems , year=

work page
[2]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

Scalable Diffusion Models with Transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

work page
[3]

Video generation models as world simulators , author=

work page
[4]

International Conference on Learning Representations , year=

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency , author=. International Conference on Learning Representations , year=

work page
[5]

International Conference on Learning Representations , year=

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference , author=. International Conference on Learning Representations , year=

work page
[6]

Advances in Neural Information Processing Systems , year=

Improved Distribution Matching Distillation for Fast Image Synthesis , author=. Advances in Neural Information Processing Systems , year=

work page
[7]

Wan 2.2: Open and Advanced Large-Scale Video Generative Models , author=

work page
[8]

HunyuanVideo: A Systematic Framework For Large Video Generation Model , author=

work page
[9]

Veo 3 Technical Report , author=

work page
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

VBench: Comprehensive Benchmark Suite for Video Generative Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page
[12]

DeepGTAV: A plugin for GTAV that transforms it into a vision-based self-driving car research environment , author=

work page
[13]

ACM SIGGRAPH 2022 Talks , year=

GAN Theft Auto: Autonomous Texturing of Procedurally Generated Interactive Cities , author=. ACM SIGGRAPH 2022 Talks , year=

work page 2022
[14]

International Conference on Machine Learning , year=

Consistency Models , author=. International Conference on Machine Learning , year=

work page
[15]

International Conference on Learning Representations , year=

Progressive Distillation for Fast Sampling of Diffusion Models , author=. International Conference on Learning Representations , year=

work page
[16]

arXiv preprint , year=

VideoLCM: Video Latent Consistency Model , author=. arXiv preprint , year=

work page
[19]

Self-consuming generative models go mad

Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luber, Ahmed Babaei, Daniel Letizia, Farinaz Bastani, and Richard Baraniuk. Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2024

work page arXiv 2024
[20]

Wan 2.2: Open and advanced large-scale video generative models

Alibaba Group . Wan 2.2: Open and advanced large-scale video generative models. Technical report, Alibaba, 2025

work page 2025
[21]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. Technical report, OpenAI, 2024

work page 2024
[22]

Videolcm: Video latent consistency model

Huiwen Chen et al. Videolcm: Video latent consistency model. In arXiv preprint, 2023

work page 2023
[23]

Consistency models made easy

Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. arXiv preprint arXiv:2406.14548, 2024

work page arXiv 2024
[24]

Veo 3 technical report

Google DeepMind . Veo 3 technical report. Technical report, Google DeepMind, 2025

work page 2025
[25]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In Advances in Neural Information Processing Systems, 2022

work page 2022
[26]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanber, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[27]

Latent consistency models: Synthesizing high-resolution images with few-step inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. In International Conference on Learning Representations, 2024

work page 2024
[28]

Deepgtav: A plugin for gtav that transforms it into a vision-based self-driving car research environment

David Mart \' nez et al. Deepgtav: A plugin for gtav that transforms it into a vision-based self-driving car research environment. https://github.com/aitorzip/DeepGTAV, 2017

work page 2017
[29]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[30]

Gan theft auto: Autonomous texturing of procedurally generated interactive cities

Santiago Rodriguez et al. Gan theft auto: Autonomous texturing of procedurally generated interactive cities. In ACM SIGGRAPH 2022 Talks. ACM, 2022

work page 2022
[31]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022

work page 2022
[32]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, 2023

work page 2023
[33]

Hunyuanvideo: A systematic framework for large video generation model

Tencent . Hunyuanvideo: A systematic framework for large video generation model. Technical report, Tencent, 2025

work page 2025
[34]

How far is video generation from world model: A physical law perspective

Bingyi Wang, Shuai Yang, Yao Chen, Yang Li, Weikang Wang, and Jingwen Gu. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024

work page arXiv 2024
[35]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Micha \"e l Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fr \'e do Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. In Advances in Neural Information Processing Systems, 2024

work page 2024
[36]

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Zehao Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Large scale diffusion distillation via score-regularized continuous-time consistency. In International Conference on Learning Representations, 2025. arXiv:2510.08431

work page internal anchor Pith review Pith/arXiv arXiv 2025