arxiv: 2511.18870 · v2 · submitted 2025-11-24 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

HunyuanVideo 1.5 Technical Report

Bing Wu , Chang Zou , Changlin Li , Duojun Huang , Fang Yang , Hao Tan , Jack Peng , Jianbing Wu

show 73 more authors

Jiangfeng Xiong Jie Jiang Linus Patrol Peizhen Zhang Peng Chen Penghao Zhao Qi Tian Songtao Liu Weijie Kong Weiyan Wang Xiao He Xin Li Xinchi Deng Xuefei Zhe Yang Li Yanxin Long Yuanbo Peng Yue Wu Yuhong Liu Zhenyu Wang Zuozhuo Dai Bo Peng Coopers Li Gu Gong Guojian Xiao Jiahe Tian Jiaxin Lin Jie Liu Jihong Zhang Jiesong Lian Kaihang Pan Lei Wang Lin Niu Mingtao Chen Mingyang Chen Mingzhe Zheng Miles Yang Qiangqiang Hu Qi Yang Qiuyong Xiao Runzhou Wu Ryan Xu Rui Yuan Shanshan Sang Shisheng Huang Siruis Gong Shuo Huang Weiting Guo Xiang Yuan Xiaojia Chen Xiawei Hu Wenzhi Sun Xiele Wu Xianshun Ren Xiaoyan Yuan Xiaoyue Mi Yepeng Zhang Yifu Sun Yiting Lu Yitong Li You Huang Yu Tang Yixuan Li Yuhang Deng Yuan Zhou Zhichao Hu Zhiguang Liu Zhihe Yang Zilin Yang Zhenzhi Lu Zixiang Zhou Zhao Zhong

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationtext-to-videoimage-to-videodiffusion transformeropen-source modelSSTA attentionmodel efficiencysuper-resolution

0 comments

The pith

An 8.3-billion-parameter model delivers state-of-the-art open-source video generation on consumer hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HunyuanVideo 1.5 as a compact video generation model that reaches new performance levels among open-source systems while using far fewer parameters than typical alternatives. It achieves this through targeted improvements in data handling, model architecture, and training procedures, resulting in better visual quality and smoother motion in generated videos. A sympathetic reader would care because the release of the model weights and code makes advanced video synthesis practical for users without access to large-scale computing clusters, potentially expanding creative and research applications.

Core claim

HunyuanVideo 1.5 is a lightweight video generation model with 8.3 billion parameters that, through meticulous data curation, an advanced diffusion transformer architecture with selective and sliding tile attention, glyph-aware text encoding for bilingual support, progressive pre- and post-training, and an efficient super-resolution network, establishes superior visual quality and motion coherence for both text-to-video and image-to-video tasks compared to prior open-source models.

What carries the argument

The diffusion transformer (DiT) architecture with selective and sliding tile attention (SSTA), which allows efficient handling of video sequences for improved coherence and quality at low parameter count.

If this is right

High-quality video generation becomes feasible on consumer-grade GPUs without specialized hardware.
The model supports unified text-to-video and image-to-video generation at multiple durations and resolutions.
Releasing the code and weights provides a foundation for community-driven improvements and applications in video creation.
Progressive training stages enable better control over the quality of outputs across different scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar lightweight designs could be adapted for other generative tasks such as audio or 3D content creation to reduce computational demands.
Broader access might accelerate experimentation in fields like education, marketing, and independent filmmaking.
The emphasis on bilingual text understanding could improve video generation for non-English languages in global applications.

Load-bearing premise

The internal benchmarks provide unbiased and comprehensive comparisons to existing open-source video models without hidden advantages in training data or evaluation choices.

What would settle it

An independent benchmark evaluation where another open-source video model achieves equal or higher scores in visual quality and motion coherence metrics using the same test sets.

read the original abstract

We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HunyuanVideo 1.5 is a solid engineering release of an 8.3B video model with SSTA attention and glyph encoding that actually ships weights and code, though the SOTA claim among open models lacks visible supporting numbers.

read the letter

This report centers on a compact text-to-video and image-to-video model that runs on consumer GPUs while claiming top results among open-source options. The main technical additions are selective and sliding tile attention inside the DiT backbone and glyph-aware text encoding to handle bilingual prompts better. They also describe careful data curation, progressive pre- and post-training, and a separate super-resolution network for flexible output sizes.

Referee Report

2 major / 1 minor

Summary. The manuscript presents HunyuanVideo 1.5, a compact 8.3-billion-parameter open-source DiT-based model for text-to-video and image-to-video generation. It claims state-of-the-art visual quality and motion coherence among open-source models through meticulous data curation, selective and sliding tile attention (SSTA), glyph-aware bilingual text encoding, progressive pre- and post-training, and an efficient super-resolution network, with all code and weights released publicly.

Significance. If the empirical claims hold under fair evaluation, the work supplies a practical, consumer-GPU-friendly foundation model that lowers barriers for video generation research and applications, extending open-source capabilities in a rapidly evolving field.

major comments (2)

[Abstract] Abstract: the central SOTA claim among open-source video generators is asserted on the basis of 'extensive experiments' yet no quantitative metrics, specific baselines, VBench or human-preference scores, error bars, or comparison tables appear in the provided text. Without these, the headline superiority cannot be assessed.
[Abstract / Experiments] The SOTA assertion depends on the fairness of internal comparisons; the manuscript must explicitly document that all baselines were evaluated under identical prompts, resolutions, frame counts, inference steps, and sampling settings, with no undisclosed test-case filtering. Video benchmarks are known to be sensitive to these controls, and the absence of such protocol details is load-bearing for the claim.

minor comments (1)

[Abstract] The abstract and introduction use the term 'unified framework' without defining its scope or distinguishing it from prior multi-task video models; a brief clarifying sentence would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the abstract should more explicitly support the SOTA claim with quantitative evidence and that evaluation protocols require fuller documentation. We will revise the manuscript to address both points.

read point-by-point responses

Referee: [Abstract] Abstract: the central SOTA claim among open-source video generators is asserted on the basis of 'extensive experiments' yet no quantitative metrics, specific baselines, VBench or human-preference scores, error bars, or comparison tables appear in the provided text. Without these, the headline superiority cannot be assessed.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will add a concise summary of the main VBench scores, human-preference win rates, and direct comparisons against the leading open-source baselines (with error bars where available). The full tables and detailed results already appear in the Experiments section; the revision will simply surface the headline numbers in the abstract itself. revision: yes
Referee: [Abstract / Experiments] The SOTA assertion depends on the fairness of internal comparisons; the manuscript must explicitly document that all baselines were evaluated under identical prompts, resolutions, frame counts, inference steps, and sampling settings, with no undisclosed test-case filtering. Video benchmarks are known to be sensitive to these controls, and the absence of such protocol details is load-bearing for the claim.

Authors: We acknowledge the need for explicit protocol transparency. In the revised manuscript we will add a dedicated subsection in Experiments that lists the exact shared settings used for every baseline: identical prompt sets, resolution, frame count, inference steps, sampler, and guidance scale. We confirm that no undisclosed test-case filtering occurred; the revision will state this explicitly and provide the full prompt list and configuration files as supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is an empirical technical report on model architecture, data curation, training stages, and benchmark results for HunyuanVideo 1.5. No mathematical derivations, parameter fits presented as predictions, or load-bearing self-citation chains exist; the SOTA claim rests on reported experimental metrics rather than any reduction of outputs to inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view limits visibility; standard diffusion training assumptions apply but no explicit free parameters or invented entities are stated.

pith-pipeline@v0.9.0 · 5789 in / 914 out tokens · 168017 ms · 2026-05-15T02:24:05.282106+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
cs.CV 2026-05 unverdicted novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
HumanScore: Benchmarking Human Motions in Generated Videos
cs.CV 2026-04 unverdicted novelty 7.0

HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps betwee...
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
GS-STVSR: Ultra-Efficient Continuous Spatio-Temporal Video Super-Resolution via 2D Gaussian Splatting
cs.CV 2026-04 unverdicted novelty 7.0

GS-STVSR achieves state-of-the-art continuous spatio-temporal video super-resolution quality with nearly constant inference time at standard scales and over 3x speedup at extreme scales using 2D Gaussian Splatting.
Grokking of Diffusion Models: Case Study on Modular Addition
cs.LG 2026-04 unverdicted novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
AnimationBench: Are Video Models Good at Character-Centric Animation?
cs.CV 2026-04 unverdicted novelty 7.0

AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
Qwen-Image-VAE-2.0 Technical Report
cs.CV 2026-05 unverdicted novelty 6.0

Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...
MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
cs.CV 2026-04 unverdicted novelty 6.0

InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling
cs.CV 2026-04 unverdicted novelty 6.0

MMPhysVideo improves physical plausibility in video diffusion models by jointly modeling RGB with perceptual cues in pseudo-RGB format via a bidirectional teacher-student architecture and a new data curation pipeline.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
cs.CV 2026-03 unverdicted novelty 6.0

A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.
Motif-Video 2B: Technical Report
cs.CV 2026-04 unverdicted novelty 5.0

Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 21 Pith papers · 5 internal anchors

[1]

Kling 2.5 turbo

Kuaishou Technology. Kling 2.5 turbo. https://app.klingai.com/cn/release-notes /2025-09-19, 2025

work page 2025
[2]

Veo 3.1.https://deepmind.google/technologies/veo/, 2025

Google DeepMind. Veo 3.1.https://deepmind.google/technologies/veo/, 2025

work page 2025
[3]

Sora 2.https://openai.com/sora, 2025

OpenAI. Sora 2.https://openai.com/sora, 2025

work page 2025
[4]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Tencent Hunyuan Foundation Model Team. Hunyuanvideo: A systematic framework for large video generative models, 2025. URLhttps://arxiv.org/abs/2412.03603

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, February 2025

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, and Deshan Sun et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025. URL https://arxiv.org/abs/2502.10248

work page arXiv 2025
[6]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, and Jingfeng Zhang et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/abs/2407.08608

work page arXiv 2024
[8]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, and Jialei Cui et al. Kimi k2: Open agentic intelligence, 2025. URLhttps://arxiv.org/abs/2507.20534

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

work page arXiv 2025
[10]

Pyscenedetect: A python library for video scene detection, 2020

PySceneDetect Contributors. Pyscenedetect: A python library for video scene detection, 2020. URLhttps://github.com/Breakthrough/PySceneDetect. Accessed: 2023-11-20

work page 2020
[11]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20144–20154, 2023

work page 2023
[12]

Mitigating hallucina- tions in large vision-language models via dpo: On-policy data hold the key

Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, and Dongsheng Li. Mitigating hallucina- tions in large vision-language models via dpo: On-policy data hold the key. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10610–10620, 2025

work page 2025
[13]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

work page
[14]

URLhttps://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Glyph-byt5: A customized text encoder for accurate visual text rendering.arXiv preprint arXiv:2403.09622, 2024

Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering, 2024. URLhttps: //arxiv.org/abs/2403.09622

work page arXiv 2024
[16]

Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention, 2025. URL https://arxiv.or g/abs/2502.04507

work page arXiv 2025
[17]

flex-block-attn: an efficient block sparse attention communication library

Yuanbo Peng, Penghao Zhao, Jiangfeng Xiong, Songtao Liu Fang Yang, Jianbing Wu, Zhao Zhong, Key, Linus, Peng Chen, and Jie Jiang. flex-block-attn: an efficient block sparse attention communication library. https://github.com/Tencent-Hunyuan/flex-block-attn , 2025

work page 2025
[18]

Spector, Simran Arora, Aaryan Singhal, Daniel Y

Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y . Fu, and Christopher Ré. Thun- derkittens: Simple, fast, and adorable ai kernels, 2024. URL https://arxiv.org/abs/2410 .20399. 13

work page 2024
[19]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024
[20]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde, 2025. URL https: //arxiv.org/abs/2507.21802

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024
[22]

Sageatten- tion2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageatten- tion2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization. In International Conference on Machine Learning (ICML), 2025. 14

work page 2025