Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

Haoxuan Wu; Hongzheng Yang; Kun Li; Lai Man Po; Mengyang Liu; Wei Liu

arxiv: 2606.20310 · v1 · pith:RILFPAWOnew · submitted 2026-06-18 · 💻 cs.CV

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

Haoxuan Wu , Lai Man Po , Mengyang Liu , Kun Li , Hongzheng Yang , Wei Liu This is my paper

Pith reviewed 2026-06-26 18:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords video diffusionpreference modelingnoisy latentsBest-of-N samplingreward decodingquery-based aggregationgenerative evaluationself-improving backbones

0 comments

The pith

A frozen video diffusion backbone can decode user preferences directly from its noisy intermediate latents via a lightweight query head.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether preference evaluation must wait for clean pixels or can instead read signals already present inside the diffusion process itself. It attaches a small query-based aggregation head to an untouched pre-trained video diffusion model and shows that the head extracts preference rankings from noisy latents at state-of-the-art accuracy. Because the signals remain readable even at high noise levels, the method supports Best-of-N filtering at the very first denoising steps, discarding weak trajectories before most compute is spent. The same setup also reveals a correlation between a backbone's generative strength and its ability to judge preferences, suggesting the two capabilities grow together. If the approach holds, reward modeling and generation can share the same frozen weights rather than requiring separate clean-image evaluators.

Core claim

PRISM shows that preference signals are already linearly or query-decodable from the intermediate noisy latents of a frozen video diffusion backbone. A lightweight Query-based Aggregation head attached to this backbone extracts those signals with higher accuracy than prior clean-pixel reward models while remaining robust across noise levels, which in turn permits early-stage Best-of-N sampling that filters suboptimal candidates before most denoising steps occur. The paper further reports a positive correlation between a backbone's generative performance and its inherent preference discrimination power.

What carries the argument

The Query-based Aggregation head, a lightweight module that pools information across noisy latents to output preference scores without altering the frozen diffusion backbone.

If this is right

Best-of-N sampling can begin at the first denoising step instead of after full generation, cutting total compute while raising output quality.
Generative performance and evaluative power are positively correlated, so stronger backbones automatically become stronger judges.
Reward modeling can reuse the same frozen weights used for generation rather than requiring separate clean-frame networks.
Video alignment pipelines can operate entirely inside the diffusion process without VAE decoding at every evaluation step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same query-head approach might be tested on image or audio diffusion models to check whether the noise-robust preference signal is modality-specific or general.
If the correlation between generation and evaluation holds, iterative self-improvement loops could alternate generation and preference filtering using only the backbone plus the head.
Early rejection at high noise could be combined with existing consistency or distillation methods to further reduce sampling cost.

Load-bearing premise

Preference signals already exist in a decodable form inside the noisy latents of a pre-trained diffusion backbone and do not require any fine-tuning of that backbone.

What would settle it

Training the query head on latents from a deliberately weak video diffusion backbone and measuring whether preference accuracy falls to chance level or whether early Best-of-N sampling fails to improve final video quality.

Figures

Figures reproduced from arXiv: 2606.20310 by Haoxuan Wu, Hongzheng Yang, Kun Li, Lai Man Po, Mengyang Liu, Wei Liu.

**Figure 2.** Figure 2: Preference alignment performance across various noise levels [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Best-of-N (BoN) sampling pipeline empowered by PRISM. Unlike conventional evaluation methods that require executing the full denoising process and VAE decoding for all candidates, PRISM performs early-stage evaluation directly in the latent space. At an intermediate timestep, PRISM scores the high-noise latents and identifies the optimal candidate. Consequently, the remaining N − 1 suboptimal trajectories … view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of BoN results. Under identical prompts, PRISM consistently identifies samples with superior semantic fidelity and physical consistency compared to pixel-based baselines (e.g., VideoReward and VideoScore2). PRISM excels in capturing precise subject composition and articulated motion, which are often compromised in baseline-guided selections. retical luxury into a highly practical de… view at source ↗

**Figure 5.** Figure 5: Efficiency-quality trade-off of Best-of- [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Comparative visualization of attention maps in the Query-based Ag [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 1.** Figure 1: Detailed breakdown of inference time cost during Best-of-5 sam [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗

**Figure 2.** Figure 2: Ablation study on timestep sampling distributions during training. [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗

**Figure 3.** Figure 3: Extra qualitative comparison of BoN results. [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: BoN efficiency-quality trade-off (N ∈ {3, 5, 10}) [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

read the original abstract

Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce \textbf{PRISM} (\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of-$N$ sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone's generative performance and its inherent evaluative power, enabling self-improving video backbones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRISM shows you can read preference signals straight from noisy latents in a frozen video diffusion backbone with a small query head, which supports early filtering.

read the letter

PRISM's main point is that preference information already sits in the intermediate noisy states of video diffusion models and a lightweight head can pull it out without retraining the backbone.

The concrete contribution is the query-based aggregation head that decodes those signals across noise levels. The work does well by reporting strong accuracy and noise robustness that lets them run Best-of-N sampling at the start of denoising instead of after full generation. That setup directly targets the compute cost of VAE decoding and post-hoc rewards. The observed correlation between generative performance and internal evaluative power is a clean empirical note that follows from the same frozen-backbone tests.

The results rest on how the head is trained and what preference data is used, so the generalization claims need the full ablations and cross-model checks to hold. The correlation is presented as an observation rather than a controlled finding, which leaves room for other explanations.

People working on video diffusion, reward modeling, or efficient sampling will get the most from this. The method is specific enough and the efficiency angle is practical enough that it deserves a full referee pass to verify the numbers and controls.

I would send it to peer review.

Referee Report

0 major / 3 minor

Summary. The paper introduces PRISM, which attaches a lightweight Query-based Aggregation head to a frozen pre-trained video diffusion backbone to decode preference signals directly from noisy intermediate latents. It reports state-of-the-art preference accuracy, strong noise robustness that enables early-stage Best-of-N sampling (filtering before full denoising), and an empirical positive correlation between a backbone's generative quality and its inherent evaluative power.

Significance. If the reported results hold, the work is significant because it demonstrates that preference information is already linearly/query-decodable from the diffusion process itself, eliminating the need for separate clean-pixel reward models and costly VAE decoding. The noise-robust early filtering result offers a concrete route to lower compute in Best-of-N pipelines, while the generative-evaluative correlation is a falsifiable empirical observation that could support self-improving video models.

minor comments (3)

[§3] §3 (method): the Query-based Aggregation head is described at a high level; an explicit equation or pseudocode showing how the learned query aggregates over the noisy latent features would improve reproducibility.
[Table 2, Figure 4] Table 2 and Figure 4: the noise-robustness curves would benefit from error bars or multiple random seeds to confirm that the early-stage Best-of-N gains are statistically reliable across backbones.
The abstract and §1 both use 'SOTA preference accuracy' without immediately citing the exact metric (e.g., accuracy@K or pairwise preference) and the primary competing reward models; this should be stated explicitly on first use.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central construction is a new lightweight query-based head applied to a frozen pre-trained video diffusion backbone to extract preference signals from noisy latents. Claims of SOTA accuracy, noise robustness, early Best-of-N sampling, and generative-evaluative correlation are presented as empirical observations from experiments, not as mathematical derivations or predictions that reduce by construction to fitted parameters or self-citations. No load-bearing step equates a result to its own inputs via definition, renaming, or self-referential fitting. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level method description; the central claim rests on the empirical existence of preference signals in noisy latents.

pith-pipeline@v0.9.1-grok · 5717 in / 1101 out tokens · 24581 ms · 2026-06-26T18:23:36.815280+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 22 canonical work pages · 16 internal anchors

[1]

arXiv preprint arXiv:2502.01051 , year=

Diffusion model as a noise-aware latent reward model for step-level preference optimization , author=. arXiv preprint arXiv:2502.01051 , year=

work page arXiv
[2]

2025 , eprint=

VideoScore2: Think before You Score in Generative Video Evaluation , author=. 2025 , eprint=

2025
[3]

Unified Reward Model for Multimodal Understanding and Generation

Unified reward model for multimodal understanding and generation , author=. arXiv preprint arXiv:2503.05236 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Improving Video Generation with Human Feedback

Improving video generation with human feedback , author=. arXiv preprint arXiv:2501.13918 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and Advanced Large-Scale Video Generative Models , author=. arXiv preprint arXiv:2503.20314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer , author=. arXiv preprint arXiv:2408.06072 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Open-Sora: Democratizing Efficient Video Production for All

Open-sora: Democratizing efficient video production for all , author=. arXiv preprint arXiv:2412.20404 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Hunyuanvideo: A systematic framework for large video generative models , author=. arXiv preprint arXiv:2412.03603 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

2025 , eprint=

SkyReels-V2: Infinite-length Film Generative Model , author=. 2025 , eprint=

2025
[10]

P. V. Rao and L. L. Kupper , journal =. Ties in Paired-Comparison Experiments: A Generalization of the Bradley-Terry Model , urldate =
[11]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[12]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[13]

The Eleventh International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=
[14]

International Conference on Medical image computing and computer-assisted intervention , pages=

U-net: Convolutional networks for biomedical image segmentation , author=. International Conference on Medical image computing and computer-assisted intervention , pages=. 2015 , organization=

2015
[15]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[16]

2022 , journal=

Scalable Diffusion Models with Transformers , author=. 2022 , journal=

2022
[17]

ArXiv , year =

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation , author =. ArXiv , year =
[18]

Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei , booktitle=
[19]

NeurIPS , year=

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models , author=. NeurIPS , year=
[20]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Transactions on Machine Learning Research , year=

MANTIS: Interleaved Multi-Image Instruction Tuning , author=. Transactions on Machine Learning Research , year=
[23]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2501.19252 , year =

Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search , author =. arXiv preprint arXiv:2501.19252 , year =

work page arXiv
[26]

2025 , booktitle=

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think , author=. 2025 , booktitle=

2025
[27]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Seedance 1.0: Exploring the boundaries of video generation models , author=. arXiv preprint arXiv:2506.09113 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Transactions on Machine Learning Research , year=

Latte: Latent Diffusion Transformer for Video Generation , author=. Transactions on Machine Learning Research , year=
[29]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Stable video diffusion: Scaling latent video diffusion models to large datasets , author=. arXiv preprint arXiv:2311.15127 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

arXiv preprint arXiv:2503.18942 , year=

Video-T1: Test-Time Scaling for Video Generation , author=. arXiv preprint arXiv:2503.18942 , year=

work page arXiv
[31]

Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

ImageReward: learning and evaluating human preferences for text-to-image generation , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=
[32]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis , author=. arXiv preprint arXiv:2306.09341 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Inference-time scaling for diffusion models beyond scaling denoising steps , author=. arXiv preprint arXiv:2501.09732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

, author=

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment. , author=. arXiv preprint arXiv:2412.04814 , year=

work page arXiv
[35]

The Eleventh International Conference on Learning Representations , year=

UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining , author=. The Eleventh International Conference on Learning Representations , year=
[36]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
[37]

arXiv preprint , year=

Video Generation Models are Good Latent Reward Models , author=. arXiv preprint , year=
[38]

arXiv preprint arXiv:2406.03035 , year=

Towards multiple character image animation through enhancing implicit decoupling , author=. arXiv preprint arXiv:2406.03035 , year=

work page arXiv
[39]

2025 , eprint=

A General Framework for Inference-time Scaling and Steering of Diffusion Models , author=. 2025 , eprint=

2025
[40]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[41]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

arXiv preprint arXiv:2503.01103 , year=

Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator , author=. arXiv preprint arXiv:2503.01103 , year=

work page arXiv

[1] [1]

arXiv preprint arXiv:2502.01051 , year=

Diffusion model as a noise-aware latent reward model for step-level preference optimization , author=. arXiv preprint arXiv:2502.01051 , year=

work page arXiv

[2] [2]

2025 , eprint=

VideoScore2: Think before You Score in Generative Video Evaluation , author=. 2025 , eprint=

2025

[3] [3]

Unified Reward Model for Multimodal Understanding and Generation

Unified reward model for multimodal understanding and generation , author=. arXiv preprint arXiv:2503.05236 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Improving Video Generation with Human Feedback

Improving video generation with human feedback , author=. arXiv preprint arXiv:2501.13918 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and Advanced Large-Scale Video Generative Models , author=. arXiv preprint arXiv:2503.20314 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer , author=. arXiv preprint arXiv:2408.06072 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Open-Sora: Democratizing Efficient Video Production for All

Open-sora: Democratizing efficient video production for all , author=. arXiv preprint arXiv:2412.20404 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Hunyuanvideo: A systematic framework for large video generative models , author=. arXiv preprint arXiv:2412.03603 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

2025 , eprint=

SkyReels-V2: Infinite-length Film Generative Model , author=. 2025 , eprint=

2025

[10] [10]

P. V. Rao and L. L. Kupper , journal =. Ties in Paired-Comparison Experiments: A Generalization of the Bradley-Terry Model , urldate =

[11] [11]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[12] [12]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[13] [13]

The Eleventh International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=

[14] [14]

International Conference on Medical image computing and computer-assisted intervention , pages=

U-net: Convolutional networks for biomedical image segmentation , author=. International Conference on Medical image computing and computer-assisted intervention , pages=. 2015 , organization=

2015

[15] [15]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

[16] [16]

2022 , journal=

Scalable Diffusion Models with Transformers , author=. 2022 , journal=

2022

[17] [17]

ArXiv , year =

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation , author =. ArXiv , year =

[18] [18]

Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei , booktitle=

[19] [19]

NeurIPS , year=

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models , author=. NeurIPS , year=

[20] [20]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Transactions on Machine Learning Research , year=

MANTIS: Interleaved Multi-Image Instruction Tuning , author=. Transactions on Machine Learning Research , year=

[23] [23]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

arXiv preprint arXiv:2501.19252 , year =

Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search , author =. arXiv preprint arXiv:2501.19252 , year =

work page arXiv

[26] [26]

2025 , booktitle=

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think , author=. 2025 , booktitle=

2025

[27] [27]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Seedance 1.0: Exploring the boundaries of video generation models , author=. arXiv preprint arXiv:2506.09113 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Transactions on Machine Learning Research , year=

Latte: Latent Diffusion Transformer for Video Generation , author=. Transactions on Machine Learning Research , year=

[29] [29]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Stable video diffusion: Scaling latent video diffusion models to large datasets , author=. arXiv preprint arXiv:2311.15127 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

arXiv preprint arXiv:2503.18942 , year=

Video-T1: Test-Time Scaling for Video Generation , author=. arXiv preprint arXiv:2503.18942 , year=

work page arXiv

[31] [31]

Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

ImageReward: learning and evaluating human preferences for text-to-image generation , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

[32] [32]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis , author=. arXiv preprint arXiv:2306.09341 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Inference-time scaling for diffusion models beyond scaling denoising steps , author=. arXiv preprint arXiv:2501.09732 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

, author=

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment. , author=. arXiv preprint arXiv:2412.04814 , year=

work page arXiv

[35] [35]

The Eleventh International Conference on Learning Representations , year=

UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining , author=. The Eleventh International Conference on Learning Representations , year=

[36] [36]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

[37] [37]

arXiv preprint , year=

Video Generation Models are Good Latent Reward Models , author=. arXiv preprint , year=

[38] [38]

arXiv preprint arXiv:2406.03035 , year=

Towards multiple character image animation through enhancing implicit decoupling , author=. arXiv preprint arXiv:2406.03035 , year=

work page arXiv

[39] [39]

2025 , eprint=

A General Framework for Inference-time Scaling and Steering of Diffusion Models , author=. 2025 , eprint=

2025

[40] [40]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[41] [41]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

arXiv preprint arXiv:2503.01103 , year=

Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator , author=. arXiv preprint arXiv:2503.01103 , year=

work page arXiv