CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

Boheng Zhang; Chunyu Lin; Dewen Fan; Fan Yang; Fei Zuo; Guosheng Lin; Haonan Fan; Honglie Wang; Huaiqing Wang; Huan Ouyang

arxiv: 2605.11723 · v2 · pith:2FXG3G62new · submitted 2026-05-12 · 💻 cs.CV · cs.AI

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

Jiyuan Wang , Huan Ouyang , Jiuzhou Lin , Chunyu Lin , Dewen Fan , Boheng Zhang , Haonan Fan , Fei Zuo

show 10 more authors

Jia Sun Huaiqing Wang Honglie Wang Yiyang Fan Zhenlong Yuan Zijun Li Yongrui Heng Guosheng Lin Fan Yang Tingting Gao

This is my paper

Pith reviewed 2026-06-30 22:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video anomaly detectionreward modelsvision language modelsspatiotemporal reasoningreinforcement learninggenerated video qualitychain of thought

0 comments

The pith

CaC improves video reward models by first scanning time for anomalies then zooming spatially with structured reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CaC as a coarse-to-fine model that anchors anomalous time windows across an entire video, then grounds specific spatial regions inside those windows, and finally reaches judgments through spatiotemporal chain-of-thought steps. This structure is enabled by a new dataset of generated videos carrying per-frame boxes, temporal windows, and attribution labels, plus a three-stage training path that begins with supervised anchoring and ends with reinforcement learning that adds temporal and spatial IoU rewards. The approach is shown to raise accuracy on fine-grained anomaly benchmarks and to serve as a reward signal that lowers anomalies in newly generated videos. A sympathetic reader would care because current reward models often miss subtle, localized defects that degrade video quality. The central mechanism is therefore the staged concentration process itself rather than any single network layer.

Core claim

CaC is a Vision-Language Model that, during inference, first performs a global temporal scan to anchor anomalous time windows, then executes fine-grained spatial grounding inside the selected interval, and finally produces judgments via structured spatiotemporal Chain-of-Thought reasoning. The model acquires these abilities through a three-stage progressive training paradigm on a newly built large-scale generated-video anomaly dataset that supplies per-frame bounding boxes, temporal windows, and fine-grained attribution labels; the stages consist of single- and multi-frame supervised fine-tuning followed by reinforcement learning that employs Group Relative Policy Optimization together with

What carries the argument

Hierarchical spatiotemporal concentrating: a coarse-to-fine inference sequence that first anchors time windows globally, then refines spatial locations locally, and finally applies structured chain-of-thought reasoning.

If this is right

Accuracy on fine-grained anomaly benchmarks rises by 25.7 percent.
When used as a reward signal, generated-video anomalies drop by 11.7 percent while overall video quality improves.
The added Temporal and Spatial IoU rewards guide the model toward more grounded intermediate localization outputs.
Structured spatiotemporal Chain-of-Thought reasoning produces more interpretable judgments than direct classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged concentration pattern could be tested on real captured video rather than only generated video to check transfer.
The GRPO stage with explicit IoU rewards might be reusable in other video tasks that require precise localization before classification.
If the dataset's anomaly distribution is narrow, performance on rare or compound anomalies could degrade even if average metrics stay high.

Load-bearing premise

The newly constructed large-scale generated video anomaly dataset with per-frame bounding boxes, temporal windows, and attribution labels is representative enough of real generated-video anomalies to let the hierarchical training generalize.

What would settle it

Measure whether the reported accuracy gains and anomaly reductions persist when CaC is evaluated on videos produced by generator models whose outputs were never seen during dataset construction or training.

Figures

Figures reproduced from arXiv: 2605.11723 by Boheng Zhang, Chunyu Lin, Dewen Fan, Fan Yang, Fei Zuo, Guosheng Lin, Haonan Fan, Honglie Wang, Huaiqing Wang, Huan Ouyang, Jia Sun, Jiuzhou Lin, Jiyuan Wang, Tingting Gao, Yiyang Fan, Yongrui Heng, Zhenlong Yuan, Zijun Li.

**Figure 2.** Figure 2: The processing pipeline for our CaC dataset. Please refer to Sec. 3.1 for details. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the CaC Reward Model’s Coarse-to-Fine three-stage training paradigm. Please [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Quantitative comparison on CaC-Bench-Hard across four difficulty splits (we report recall [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CaC adds a new generated-video anomaly dataset and staged GRPO training with IoU rewards, but the gains rest on unverified dataset representativeness.

read the letter

CaC adds the first large-scale generated video anomaly dataset with per-frame bounding boxes, temporal windows, and attribution labels, plus a three-stage training pipeline that starts with supervised fine-tuning on single and multi-frame data before moving to GRPO with added Temporal and Spatial IoU rewards.

The dataset construction and the IoU rewards on the localization steps are the clearest new pieces. The coarse-to-fine inference flow—global temporal scan, then spatial grounding inside the window, then structured CoT—gives the model an explicit way to focus on subtle issues rather than treating the whole video at once.

The main soft spot is the dataset. Both the 25.7% accuracy gain on fine-grained benchmarks and the 11.7% anomaly reduction when the model is used as a reward depend on the generated anomalies being close enough to real ones for the training to transfer. The abstract supplies no dataset size, diversity stats, or external validation against established real-world anomaly video collections, so it is possible the numbers are inflated by easier synthetic cases. There are also no ablations on the stages or rewards and no error bars, which makes it hard to judge how stable the improvements are.

This paper is for researchers building reward models for video generation or working on VLM-based anomaly detection. Someone in that niche could take the training recipe or the annotation scheme and adapt it.

It deserves peer review because the new data and concrete method are substantive enough to evaluate, even if the experiments will need more rigor.

Referee Report

2 major / 2 minor

Summary. The paper proposes CaC, a coarse-to-fine anomaly reward model for videos based on Vision-Language Models. During inference it performs a global temporal scan to anchor anomalous time windows, followed by fine-grained spatial grounding and structured spatiotemporal Chain-of-Thought reasoning. To train the model, the authors construct a new large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal windows, and attribution labels. Training proceeds in three stages: single- and multi-frame supervised fine-tuning, followed by reinforcement learning via two-turn Group Relative Policy Optimization (GRPO) that incorporates conventional accuracy rewards plus novel Temporal and Spatial IoU rewards. Experiments claim a 25.7% accuracy improvement on fine-grained anomaly benchmarks and an 11.7% reduction in generated-video anomalies when CaC is used as a reward signal, together with improved overall video quality.

Significance. If the empirical gains prove robust and the synthetic dataset transfers, the hierarchical concentrating paradigm plus IoU-supervised GRPO could meaningfully advance video reward models by enabling more interpretable and localized anomaly detection, thereby improving downstream video generation quality. The construction of the first large-scale annotated generated-video anomaly dataset and the explicit supervision of intermediate localization steps constitute concrete, potentially reusable contributions.

major comments (2)

[§3.1] §3.1 (Dataset Construction): the central generalization claims (25.7% accuracy gain and 11.7% anomaly reduction) rest on the assumption that the constructed generated-video anomaly dataset is distributionally representative of real-world anomalies. No comparison or validation against established real-world corpora (e.g., UCSD Pedestrian, CUHK Avenue) is described; if synthetic anomalies are systematically shorter, less contextually subtle, or easier to localize, the hierarchical training and IoU rewards may produce inflated numbers that fail to transfer.
[§4] §4 (Experiments): the reported improvements are presented without error bars, statistical significance tests, or ablation tables isolating the contribution of the coarse-to-fine stages, the GRPO procedure, and the Temporal/Spatial IoU rewards. This absence makes it impossible to verify that the hierarchical process and RL rewards causally produce the claimed gains rather than other factors such as dataset scale or base VLM choice.

minor comments (2)

[Abstract] Abstract: the specific fine-grained anomaly benchmarks on which the 25.7% improvement is measured are not named, reducing immediate clarity.
[§2] §2 (Related Work): the positioning relative to prior video anomaly detection and VLM-based reward models could be tightened by citing the most directly comparable recent works on spatiotemporal grounding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We provide point-by-point responses to the major concerns below and commit to incorporating necessary revisions to strengthen the paper.

read point-by-point responses

Referee: [§3.1] §3.1 (Dataset Construction): the central generalization claims (25.7% accuracy gain and 11.7% anomaly reduction) rest on the assumption that the constructed generated-video anomaly dataset is distributionally representative of real-world anomalies. No comparison or validation against established real-world corpora (e.g., UCSD Pedestrian, CUHK Avenue) is described; if synthetic anomalies are systematically shorter, less contextually subtle, or easier to localize, the hierarchical training and IoU rewards may produce inflated numbers that fail to transfer.

Authors: We thank the referee for highlighting this important point. Our dataset and method are specifically designed for detecting and mitigating anomalies in AI-generated videos, which is the primary application for using CaC as a reward model in video generation pipelines. The fine-grained anomaly benchmarks used in our experiments are standard in the field and include scenarios relevant to generated content. Nevertheless, to address the concern about transferability, in the revised version we will add a new subsection in §3.1 discussing the differences and similarities between our generated anomaly dataset and real-world anomaly datasets like UCSD Pedestrian and CUHK Avenue, including statistics on anomaly duration, context, and localization difficulty. We will also provide qualitative comparisons. We believe this will clarify the scope and limitations without altering the core claims, as our focus is on the generated video domain. revision: partial
Referee: [§4] §4 (Experiments): the reported improvements are presented without error bars, statistical significance tests, or ablation tables isolating the contribution of the coarse-to-fine stages, the GRPO procedure, and the Temporal/Spatial IoU rewards. This absence makes it impossible to verify that the hierarchical process and RL rewards causally produce the claimed gains rather than other factors such as dataset scale or base VLM choice.

Authors: We agree that the experimental section would benefit from additional statistical rigor and ablations. In the revised manuscript, we will include error bars (standard deviation over multiple runs) for all reported metrics, perform statistical significance tests (e.g., paired t-tests) where appropriate, and add comprehensive ablation studies in §4. These will isolate the contributions of the coarse-to-fine inference stages, the GRPO training, and the novel Temporal/Spatial IoU rewards. We will also compare against variants using only accuracy rewards to demonstrate the causal impact of the IoU components. This will strengthen the evidence for our claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are empirical and self-contained.

full rationale

The paper presents no mathematical derivations, equations, or fitted parameters that reduce to inputs by construction. Central claims rest on constructing a new dataset and applying a three-stage training procedure (supervised fine-tuning followed by GRPO with IoU rewards), evaluated via accuracy and anomaly reduction metrics. No self-definitional steps, uniqueness theorems, or ansatzes smuggled via citation are present. The derivation chain is absent; results are benchmarked externally rather than forced by internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5812 in / 1137 out tokens · 27357 ms · 2026-06-30T22:36:18.731555+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation
cs.CV 2026-07 unverdicted novelty 6.0

Formalizes Fashion Detail Generation task, releases FDBench benchmark with 40K+ pairs, and proposes CFAD distillation method plus RL consistency reward that outperforms open-source baselines.
IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools
cs.CV 2026-05 unverdicted novelty 5.0

IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localizatio...

Reference graph

Works this paper leans on

83 extracted references · 53 canonical work pages · cited by 2 Pith papers · 28 internal anchors

[1]

A survey on video anomaly detection via deep learning: Human, vehicle, and environment.arXiv preprint arXiv:2508.14203, 2025

Ghazal Alinezhad Noghre, Armin Danesh Pazho, and Hamed Tabkhi. A survey on video anomaly detection via deep learning: Human, vehicle, and environment.arXiv preprint arXiv:2508.14203, 2025

work page arXiv 2025
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

arXiv preprint arXiv:2503.06800 (2025) 3, 4

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

work page arXiv 2025
[4]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

2024
[5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

2024
[6]

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, et al. Phygdpo: Physics-aware groupwise direct preference optimization for physically consistent text-to-video generation.arXiv preprint arXiv:2512.24551, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Miguel Carvalho, Helder Dias, and Bruno Martins. Cropvlm: Learning to zoom for fine-grained vision-language perception, 2026. URLhttps://arxiv.org/abs/2511.19820

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Zeyang Chen, Mingnan Hu, and Bo Chen. Thermal image-guided complementary masking with multiscale fusion for multi-spectral image semantic segmentation.Engineering Applications of Artificial Intelligence, 150:110569, 2025

2025
[9]

Unsupervised multi-modal domain adaptation for rgb-t semantic segmentation.Computer Vision and Image Understanding, page 104573, 2025

Zeyang Chen, Chunyu Lin, Yao Zhao, and Tammam Tillo. Unsupervised multi-modal domain adaptation for rgb-t semantic segmentation.Computer Vision and Image Understanding, page 104573, 2025

2025
[10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

work page arXiv 2025
[13]

Sculpting features from noise: Reward-guided hierarchical diffusion for task-optimal feature transformation.arXiv preprint arXiv:2505.15152, 2025

Nanxu Gong, Zijun Li, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, and Yanjie Fu. Sculpting features from noise: Reward-guided hierarchical diffusion for task-optimal feature transformation.arXiv preprint arXiv:2505.15152, 2025

work page arXiv 2025
[14]

Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

Google. Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

2025
[15]

RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation

Cheng Guan, Chunyu Lin, Zhijie Shen, Junsong Zhang, and Jiyuan Wang. Reper-360: Re- leasing perspective priors for 360 ◦ depth estimation via self-modulation.arXiv preprint arXiv:2603.05999, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024

2024
[20]

VideoScore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. Videoscore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

work page arXiv 2025
[21]

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

Yongrui Heng, Chaoya Jiang, Han Yang, Shikun Zhang, and Wei Ye. Eve: Verifiable self- evolution of mllms via executable visual transformations.arXiv preprint arXiv:2604.18320, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025

2025
[23]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[24]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Vlm-r3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.Advances in Neural Information Processing Systems, 38:63841–63869, 2026

Chaoya Jiang, Yongrui Heng, Wei Ye, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.Advances in Neural Information Processing Systems, 38:63841–63869, 2026

2026
[26]

Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

2024
[27]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

2023
[28]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Subjective-aligned dataset and metric for text-to-video quality assessment

Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, and Ning Liu. Subjective-aligned dataset and metric for text-to-video quality assessment. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7793–7802, 2024

2024
[30]

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, and Jiwen Lu. Skyra: Ai-generated video detection via grounded artifact reasoning.arXiv preprint arXiv:2512.15693, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Diffpcn: Latent diffusion model based on multi-view depth images for point cloud completion.arXiv preprint arXiv:2509.23723, 2025

Zijun Li, Hongyu Yan, Shijie Li, Kunming Luo, Li Lu, Xulei Yang, and Weisi Lin. Diffpcn: Latent diffusion model based on multi-view depth images for point cloud completion.arXiv preprint arXiv:2509.23723, 2025. 11

work page arXiv 2025
[32]

Drl4aoi: A drl framework for semantic-aware aoi segmentation in location-based services.arXiv preprint arXiv:2412.05437, 2024

Youfang Lin, Jinji Fu, Haomin Wen, Jiyuan Wang, Zhenjie Wei, Yuting Qiang, Xiaowei Mao, Lixia Wu, Haoyuan Hu, Yuxuan Liang, et al. Drl4aoi: A drl framework for semantic-aware aoi segmentation in location-based services.arXiv preprint arXiv:2412.05437, 2024

work page arXiv 2024
[33]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Dynamic typography: Bringing text to life via video diffusion prior

Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, and Huamin Qu. Dynamic typography: Bringing text to life via video diffusion prior. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14787–14797, 2025

2025
[36]

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Panosent: A panoptic sextuple extraction benchmark for multimodal conversational aspect-based sentiment analysis

Meng Luo, Hao Fei, Bobo Li, Shengqiong Wu, Qian Liu, Soujanya Poria, Erik Cambria, Mong- Li Lee, and Wynne Hsu. Panosent: A panoptic sextuple extraction benchmark for multimodal conversational aspect-based sentiment analysis. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7667–7676, 2024

2024
[38]

Meng Luo, Shengqiong Wu, Liqiang Jing, Tianjie Ju, Li Zheng, Jinxiang Lai, Tianlong Wu, Xinya Du, Jian Li, Siyuan Yan, et al. Dr. v: A hierarchical perception-temporal-cognition framework to diagnose video hallucination by fine-grained spatial-temporal grounding.arXiv preprint arXiv:2509.11866, 2025

work page arXiv 2025
[39]

Unveiling the cognitive compass: Theory-of-mind-guided multimodal emotion reasoning, 2026

Meng Luo, Bobo Li, Shanqing Xu, Shize Zhang, Qiuchan Chen, Menglu Han, Wenhao Chen, Yanxiang Huang, Hao Fei, Mong-Li Lee, and Wynne Hsu. Unveiling the cognitive compass: Theory-of-mind-guided multimodal emotion reasoning, 2026. URL https://arxiv.org/ abs/2602.00971

work page arXiv 2026
[40]

Hpsv3: Towards wide-spectrum hu- man preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

2025
[41]

Skywork r1v: Pioneering multimodal reasoning with chain-of-thought

Yi Peng, Peiyu Wang, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, et al. Skywork r1v: Pioneering multimodal reasoning with chain-of-thought.arXiv preprint arXiv:2504.05599, 2025

work page arXiv 2025
[42]

Ties in paired-comparison experiments: A general- ization of the bradley-terry model.Journal of the American Statistical Association, 62(317): 194–204, 1967

Pejaver V Rao and Lawrence L Kupper. Ties in paired-comparison experiments: A general- ization of the bradley-terry model.Journal of the American Statistical Association, 62(317): 194–204, 1967

1967
[43]

Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

work page arXiv 2024
[44]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Man Zhou, Chengjun Xie, Haoxuan Che, Xuanhua He, and Jie Zhang. Svfeye: A semantic-visual fusion framework with multi-scale visual context for multimodal reasoning, 2026. URL https://arxiv.org/abs/2603.00171

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Videoveritas: Ai-generated video detection via perception pretext reinforcement learning.arXiv preprint arXiv:2602.08828, 2026

Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Videoveritas: Ai-generated video detection via perception pretext reinforcement learning.arXiv preprint arXiv:2602.08828, 2026. 12

work page arXiv 2026
[48]

arXiv:2503.20752 , year=

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models.arXiv preprint arXiv:2503.20752, 2025

work page arXiv 2025
[49]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

Xuezhen Tu, Jingyu Wu, Fangyu Kang, Qingpeng Nong, Kaijin Zhang, Chaoyue Niu, and Fan Wu. Bridging time and space: Decoupled spatio-temporal alignment for video grounding, 2026. URLhttps://arxiv.org/abs/2604.08014

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Learning to generate stylized handwritten text via a unified representation of style, content, and noise

Honglie Wang, Yan-Ming Zhang, Wangzi Yao, Fei Yin, and Cheng-Lin Liu. Learning to generate stylized handwritten text via a unified representation of style, content, and noise. In The Fourteenth International Conference on Learning Representations
[54]

Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability, 2025

Jiankang Wang, Zhihan Zhang, Zhihang Liu, Yang Li, Jiannan Ge, Hongtao Xie, and Yongdong Zhang. Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability, 2025. URLhttps://arxiv.org/abs/2503.13983

work page arXiv 2025
[55]

TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, et al. Tagrpo: Boosting grpo on image-to-video generation with direct trajectory alignment.arXiv preprint arXiv:2601.05729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

Weath- erdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions

Jiyuan Wang, Chunyu Lin, Lang Nie, Shujun Huang, Yao Zhao, Xing Pan, and Rui Ai. Weath- erdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4976–4982. IEEE, 2024

2024
[57]

Digging into contrastive learning for robust depth estimation with diffusion models

Jiyuan Wang, Chunyu Lin, Lang Nie, Kang Liao, Shuwei Shao, and Yao Zhao. Digging into contrastive learning for robust depth estimation with diffusion models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4129–4137. ACM, 2024

2024
[58]

Jasmine: Harnessing diffusion prior for self-supervised depth estimation.arXiv preprint arXiv:2503.15905, 2025

Jiyuan Wang, Chunyu Lin, Cheng Guan, Lang Nie, Jing He, Haodong Li, Kang Liao, and Yao Zhao. Jasmine: Harnessing diffusion prior for self-supervised depth estimation.arXiv preprint arXiv:2503.15905, 2025

work page arXiv 2025
[59]

From editor to dense geometry estimator.arXiv preprint arXiv:2509.04338, 2025

JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, and Yao Zhao. From editor to dense geometry estimator.arXiv preprint arXiv:2509.04338, 2025

work page arXiv 2025
[60]

Edit in 2D, Verify in 3D: Reinforcement Learning for Multi-view Consistent Scene Editing

Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xi- angxiang Chu, Yunchao Wei, Kang Liao, et al. Geometry-guided reinforcement learning for multi-view consistent 3d scene editing.arXiv preprint arXiv:2603.03143, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

arXiv preprint arXiv:2510.10518 , year=

Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, et al. Vr-thinker: Boosting video reward models through thinking-with-image reasoning.arXiv preprint arXiv:2510.10518, 2025

work page arXiv 2025
[63]

Lift: Leveraging human feedback for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feedback for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

work page arXiv 2024
[64]

Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

work page arXiv 2025
[65]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Thinking with frames: Generative video distortion evaluation via frame reward model.arXiv preprint arXiv:2601.04033, 2026

Yuan Wang, Borui Liao, Huijuan Huang, Jinda Lu, Ouxiang Li, Kuien Liu, Meng Wang, and Xiang Wang. Thinking with frames: Generative video distortion evaluation via frame reward model.arXiv preprint arXiv:2601.04033, 2026

work page arXiv 2026
[67]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023

2023
[68]

RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

work page arXiv 2025
[69]

Phyprompt: Rl-based prompt refinement for physically plausible text-to-video generation.arXiv preprint arXiv:2603.03505, 2026

Shang Wu, Chenwei Xu, Zhuofan Xia, Weijian Li, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, and Han Liu. Phyprompt: Rl-based prompt refinement for physically plausible text-to-video generation.arXiv preprint arXiv:2603.03505, 2026

work page arXiv 2026
[70]

arXiv preprint arXiv:2505.14460 (2025)

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

work page arXiv 2025
[71]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023
[72]

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Reason: Reinforced causal search with information bottleneck for video understanding, 2025

Yuan Zhou, Litao Hua, Shilong Jin, Wentao Huang, and Haoran Duan. Reason: Reinforced causal search with information bottleneck for video understanding, 2025. URL https:// arxiv.org/abs/2511.12530

work page arXiv 2025
[75]

arXiv preprint arXiv:2510.27280 (2025) 4

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, and Yang You. Focus: Efficient keyframe selection for long video understanding, 2025. URL https: //arxiv.org/abs/2510.27280. 14 Appendix In this appendix, we provide additional implementation details, experimental results, analyses, and discussions for a comprehensive evaluation and ...

work page arXiv 2025
[76]

no visible person

Human analysis(if there is no visible person, you must explicitly write “no visible person”), and the analysis must explicitly state whether an anomaly exists. Clothing: precisely describe the clothing type, color, clarity of patterns or textures, and all accessories. Posture: describe the overall pose and specifically describe the positions of the limbs....
[77]

Object identification: list all visible objects

Object analysis, and the analysis must explicitly state whether an anomaly exists. Object identification: list all visible objects. Object attributes: describe the shape, color, and texture clarity of each object, and evaluate whether its size proportion is reasonable. Spatial relationship: state the positions between objects and people, and between objec...
[78]

no anomaly

Environment analysis, and the analysis must explicitly state whether an anomaly exists. Scene: clearly specify the background type. Lighting: judge the direction of the main light source, and describe the position, length, and softness or hardness of shadows. Environment details: check whether the background contains blur, repeated textures, geometric dis...
[79]

The error type is exactly the same: the error category used and described by the AI model must be completely consistent with the human annotation
[80]

the erroneous object identified by the AI model must refer to the same entity or the same region as the human annotation

The erroneous object is essentially the same: the erroneous object identified by the AI model must refer to the same entity as the human annotation. the erroneous object identified by the AI model must refer to the same entity or the same region as the human annotation

Showing first 80 references.

[1] [1]

A survey on video anomaly detection via deep learning: Human, vehicle, and environment.arXiv preprint arXiv:2508.14203, 2025

Ghazal Alinezhad Noghre, Armin Danesh Pazho, and Hamed Tabkhi. A survey on video anomaly detection via deep learning: Human, vehicle, and environment.arXiv preprint arXiv:2508.14203, 2025

work page arXiv 2025

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

arXiv preprint arXiv:2503.06800 (2025) 3, 4

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

work page arXiv 2025

[4] [4]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

2024

[5] [5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

2024

[6] [6]

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, et al. Phygdpo: Physics-aware groupwise direct preference optimization for physically consistent text-to-video generation.arXiv preprint arXiv:2512.24551, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Miguel Carvalho, Helder Dias, and Bruno Martins. Cropvlm: Learning to zoom for fine-grained vision-language perception, 2026. URLhttps://arxiv.org/abs/2511.19820

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Zeyang Chen, Mingnan Hu, and Bo Chen. Thermal image-guided complementary masking with multiscale fusion for multi-spectral image semantic segmentation.Engineering Applications of Artificial Intelligence, 150:110569, 2025

2025

[9] [9]

Unsupervised multi-modal domain adaptation for rgb-t semantic segmentation.Computer Vision and Image Understanding, page 104573, 2025

Zeyang Chen, Chunyu Lin, Yao Zhao, and Tammam Tillo. Unsupervised multi-modal domain adaptation for rgb-t semantic segmentation.Computer Vision and Image Understanding, page 104573, 2025

2025

[10] [10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024

[11] [11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

work page arXiv 2025

[13] [13]

Sculpting features from noise: Reward-guided hierarchical diffusion for task-optimal feature transformation.arXiv preprint arXiv:2505.15152, 2025

Nanxu Gong, Zijun Li, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, and Yanjie Fu. Sculpting features from noise: Reward-guided hierarchical diffusion for task-optimal feature transformation.arXiv preprint arXiv:2505.15152, 2025

work page arXiv 2025

[14] [14]

Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

Google. Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

2025

[15] [15]

RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation

Cheng Guan, Chunyu Lin, Zhijie Shen, Junsong Zhang, and Jiyuan Wang. Reper-360: Re- leasing perspective priors for 360 ◦ depth estimation via self-modulation.arXiv preprint arXiv:2603.05999, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024

2024

[20] [20]

VideoScore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. Videoscore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

work page arXiv 2025

[21] [21]

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

Yongrui Heng, Chaoya Jiang, Han Yang, Shikun Zhang, and Wei Ye. Eve: Verifiable self- evolution of mllms via executable visual transformations.arXiv preprint arXiv:2604.18320, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025

2025

[23] [23]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[24] [24]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Vlm-r3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.Advances in Neural Information Processing Systems, 38:63841–63869, 2026

Chaoya Jiang, Yongrui Heng, Wei Ye, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.Advances in Neural Information Processing Systems, 38:63841–63869, 2026

2026

[26] [26]

Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

2024

[27] [27]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

2023

[28] [28]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Subjective-aligned dataset and metric for text-to-video quality assessment

Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, and Ning Liu. Subjective-aligned dataset and metric for text-to-video quality assessment. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7793–7802, 2024

2024

[30] [30]

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, and Jiwen Lu. Skyra: Ai-generated video detection via grounded artifact reasoning.arXiv preprint arXiv:2512.15693, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Diffpcn: Latent diffusion model based on multi-view depth images for point cloud completion.arXiv preprint arXiv:2509.23723, 2025

Zijun Li, Hongyu Yan, Shijie Li, Kunming Luo, Li Lu, Xulei Yang, and Weisi Lin. Diffpcn: Latent diffusion model based on multi-view depth images for point cloud completion.arXiv preprint arXiv:2509.23723, 2025. 11

work page arXiv 2025

[32] [32]

Drl4aoi: A drl framework for semantic-aware aoi segmentation in location-based services.arXiv preprint arXiv:2412.05437, 2024

Youfang Lin, Jinji Fu, Haomin Wen, Jiyuan Wang, Zhenjie Wei, Yuting Qiang, Xiaowei Mao, Lixia Wu, Haoyuan Hu, Yuxuan Liang, et al. Drl4aoi: A drl framework for semantic-aware aoi segmentation in location-based services.arXiv preprint arXiv:2412.05437, 2024

work page arXiv 2024

[33] [33]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Dynamic typography: Bringing text to life via video diffusion prior

Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, and Huamin Qu. Dynamic typography: Bringing text to life via video diffusion prior. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14787–14797, 2025

2025

[36] [36]

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Panosent: A panoptic sextuple extraction benchmark for multimodal conversational aspect-based sentiment analysis

Meng Luo, Hao Fei, Bobo Li, Shengqiong Wu, Qian Liu, Soujanya Poria, Erik Cambria, Mong- Li Lee, and Wynne Hsu. Panosent: A panoptic sextuple extraction benchmark for multimodal conversational aspect-based sentiment analysis. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7667–7676, 2024

2024

[38] [38]

Meng Luo, Shengqiong Wu, Liqiang Jing, Tianjie Ju, Li Zheng, Jinxiang Lai, Tianlong Wu, Xinya Du, Jian Li, Siyuan Yan, et al. Dr. v: A hierarchical perception-temporal-cognition framework to diagnose video hallucination by fine-grained spatial-temporal grounding.arXiv preprint arXiv:2509.11866, 2025

work page arXiv 2025

[39] [39]

Unveiling the cognitive compass: Theory-of-mind-guided multimodal emotion reasoning, 2026

Meng Luo, Bobo Li, Shanqing Xu, Shize Zhang, Qiuchan Chen, Menglu Han, Wenhao Chen, Yanxiang Huang, Hao Fei, Mong-Li Lee, and Wynne Hsu. Unveiling the cognitive compass: Theory-of-mind-guided multimodal emotion reasoning, 2026. URL https://arxiv.org/ abs/2602.00971

work page arXiv 2026

[40] [40]

Hpsv3: Towards wide-spectrum hu- man preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

2025

[41] [41]

Skywork r1v: Pioneering multimodal reasoning with chain-of-thought

Yi Peng, Peiyu Wang, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, et al. Skywork r1v: Pioneering multimodal reasoning with chain-of-thought.arXiv preprint arXiv:2504.05599, 2025

work page arXiv 2025

[42] [42]

Ties in paired-comparison experiments: A general- ization of the bradley-terry model.Journal of the American Statistical Association, 62(317): 194–204, 1967

Pejaver V Rao and Lawrence L Kupper. Ties in paired-comparison experiments: A general- ization of the bradley-terry model.Journal of the American Statistical Association, 62(317): 194–204, 1967

1967

[43] [43]

Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

work page arXiv 2024

[44] [44]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[45] [45]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Man Zhou, Chengjun Xie, Haoxuan Che, Xuanhua He, and Jie Zhang. Svfeye: A semantic-visual fusion framework with multi-scale visual context for multimodal reasoning, 2026. URL https://arxiv.org/abs/2603.00171

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

Videoveritas: Ai-generated video detection via perception pretext reinforcement learning.arXiv preprint arXiv:2602.08828, 2026

Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Videoveritas: Ai-generated video detection via perception pretext reinforcement learning.arXiv preprint arXiv:2602.08828, 2026. 12

work page arXiv 2026

[48] [48]

arXiv:2503.20752 , year=

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models.arXiv preprint arXiv:2503.20752, 2025

work page arXiv 2025

[49] [49]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

Xuezhen Tu, Jingyu Wu, Fangyu Kang, Qingpeng Nong, Kaijin Zhang, Chaoyue Niu, and Fan Wu. Bridging time and space: Decoupled spatio-temporal alignment for video grounding, 2026. URLhttps://arxiv.org/abs/2604.08014

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Learning to generate stylized handwritten text via a unified representation of style, content, and noise

Honglie Wang, Yan-Ming Zhang, Wangzi Yao, Fei Yin, and Cheng-Lin Liu. Learning to generate stylized handwritten text via a unified representation of style, content, and noise. In The Fourteenth International Conference on Learning Representations

[54] [54]

Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability, 2025

Jiankang Wang, Zhihan Zhang, Zhihang Liu, Yang Li, Jiannan Ge, Hongtao Xie, and Yongdong Zhang. Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability, 2025. URLhttps://arxiv.org/abs/2503.13983

work page arXiv 2025

[55] [55]

TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, et al. Tagrpo: Boosting grpo on image-to-video generation with direct trajectory alignment.arXiv preprint arXiv:2601.05729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [56]

Weath- erdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions

Jiyuan Wang, Chunyu Lin, Lang Nie, Shujun Huang, Yao Zhao, Xing Pan, and Rui Ai. Weath- erdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4976–4982. IEEE, 2024

2024

[57] [57]

Digging into contrastive learning for robust depth estimation with diffusion models

Jiyuan Wang, Chunyu Lin, Lang Nie, Kang Liao, Shuwei Shao, and Yao Zhao. Digging into contrastive learning for robust depth estimation with diffusion models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4129–4137. ACM, 2024

2024

[58] [58]

Jasmine: Harnessing diffusion prior for self-supervised depth estimation.arXiv preprint arXiv:2503.15905, 2025

Jiyuan Wang, Chunyu Lin, Cheng Guan, Lang Nie, Jing He, Haodong Li, Kang Liao, and Yao Zhao. Jasmine: Harnessing diffusion prior for self-supervised depth estimation.arXiv preprint arXiv:2503.15905, 2025

work page arXiv 2025

[59] [59]

From editor to dense geometry estimator.arXiv preprint arXiv:2509.04338, 2025

JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, and Yao Zhao. From editor to dense geometry estimator.arXiv preprint arXiv:2509.04338, 2025

work page arXiv 2025

[60] [60]

Edit in 2D, Verify in 3D: Reinforcement Learning for Multi-view Consistent Scene Editing

Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xi- angxiang Chu, Yunchao Wei, Kang Liao, et al. Geometry-guided reinforcement learning for multi-view consistent 3d scene editing.arXiv preprint arXiv:2603.03143, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

arXiv preprint arXiv:2510.10518 , year=

Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, et al. Vr-thinker: Boosting video reward models through thinking-with-image reasoning.arXiv preprint arXiv:2510.10518, 2025

work page arXiv 2025

[63] [63]

Lift: Leveraging human feedback for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feedback for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

work page arXiv 2024

[64] [64]

Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

work page arXiv 2025

[65] [65]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Thinking with frames: Generative video distortion evaluation via frame reward model.arXiv preprint arXiv:2601.04033, 2026

Yuan Wang, Borui Liao, Huijuan Huang, Jinda Lu, Ouxiang Li, Kuien Liu, Meng Wang, and Xiang Wang. Thinking with frames: Generative video distortion evaluation via frame reward model.arXiv preprint arXiv:2601.04033, 2026

work page arXiv 2026

[67] [67]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023

2023

[68] [68]

RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

work page arXiv 2025

[69] [69]

Phyprompt: Rl-based prompt refinement for physically plausible text-to-video generation.arXiv preprint arXiv:2603.03505, 2026

Shang Wu, Chenwei Xu, Zhuofan Xia, Weijian Li, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, and Han Liu. Phyprompt: Rl-based prompt refinement for physically plausible text-to-video generation.arXiv preprint arXiv:2603.03505, 2026

work page arXiv 2026

[70] [70]

arXiv preprint arXiv:2505.14460 (2025)

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

work page arXiv 2025

[71] [71]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023

[72] [72]

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[73] [73]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

Reason: Reinforced causal search with information bottleneck for video understanding, 2025

Yuan Zhou, Litao Hua, Shilong Jin, Wentao Huang, and Haoran Duan. Reason: Reinforced causal search with information bottleneck for video understanding, 2025. URL https:// arxiv.org/abs/2511.12530

work page arXiv 2025

[75] [75]

arXiv preprint arXiv:2510.27280 (2025) 4

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, and Yang You. Focus: Efficient keyframe selection for long video understanding, 2025. URL https: //arxiv.org/abs/2510.27280. 14 Appendix In this appendix, we provide additional implementation details, experimental results, analyses, and discussions for a comprehensive evaluation and ...

work page arXiv 2025

[76] [76]

no visible person

Human analysis(if there is no visible person, you must explicitly write “no visible person”), and the analysis must explicitly state whether an anomaly exists. Clothing: precisely describe the clothing type, color, clarity of patterns or textures, and all accessories. Posture: describe the overall pose and specifically describe the positions of the limbs....

[77] [77]

Object identification: list all visible objects

Object analysis, and the analysis must explicitly state whether an anomaly exists. Object identification: list all visible objects. Object attributes: describe the shape, color, and texture clarity of each object, and evaluate whether its size proportion is reasonable. Spatial relationship: state the positions between objects and people, and between objec...

[78] [78]

no anomaly

Environment analysis, and the analysis must explicitly state whether an anomaly exists. Scene: clearly specify the background type. Lighting: judge the direction of the main light source, and describe the position, length, and softness or hardness of shadows. Environment details: check whether the background contains blur, repeated textures, geometric dis...

[79] [79]

The error type is exactly the same: the error category used and described by the AI model must be completely consistent with the human annotation

[80] [80]

the erroneous object identified by the AI model must refer to the same entity or the same region as the human annotation

The erroneous object is essentially the same: the erroneous object identified by the AI model must refer to the same entity as the human annotation. the erroneous object identified by the AI model must refer to the same entity or the same region as the human annotation