Recognition: 2 theorem links
· Lean TheoremUnified Reward Model for Multimodal Understanding and Generation
Pith reviewed 2026-05-14 00:39 UTC · model grok-4.3
The pith
A single reward model trained jointly on image and video tasks improves preference alignment for both understanding and generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Jointly training a reward model to assess diverse visual tasks produces mutual benefits, where improved image understanding strengthens image generation assessment and refined evaluation aids video assessment through better frame analysis. UnifiedReward, trained on a large-scale human preference dataset covering image and video tasks, is then used via a two-stage filtering process to generate high-quality pairwise preference data that aligns vision models with human preferences through Direct Preference Optimization.
What carries the argument
UnifiedReward, a unified model supporting pairwise ranking and pointwise scoring to supply reward signals for vision model preference alignment.
If this is right
- Reward signals from the unified model improve preference optimization results for both image and video generation models.
- Joint training reduces the performance gap between separate understanding and generation reward models.
- The same model can supply both ranking and scoring supervision without retraining for each new vision task.
- Two-stage filtering of model outputs yields cleaner preference pairs than direct human annotation at scale.
Where Pith is reading between the lines
- The approach may lower the cost of maintaining separate reward models when adding new visual modalities.
- Synergies observed between image and video tasks suggest similar gains could appear if audio or 3D tasks were added to the training mix.
- Downstream models aligned this way might generalize better to unseen visual distributions because the reward model itself was trained across varied tasks.
Load-bearing premise
The large-scale human preference dataset accurately represents human judgments across tasks and the two-stage filtering strategy produces high-quality, unbiased preference pairs without introducing selection artifacts.
What would settle it
Apply UnifiedReward-derived preferences to align a vision model and measure whether human raters prefer its outputs over a baseline aligned with task-specific reward models at a statistically significant rate.
read the original abstract
Recent advances in human preference alignment have significantly improved multimodal generation and understanding. A key approach is to train reward models that provide supervision signals for preference optimization. However, existing reward models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that a reward model that jointly learning to assess multiple vision tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment. It supports both pairwise ranking and pointwise scoring, providing effective reward signals for vision model preference alignment. Specifically, (1) we first train UnifiedReward on our constructed large-scale human preference dataset, which covers both image and video generation/understanding tasks. (2) Then, we leverage it to automatically construct high-quality pairwise preference data from vision models by progressively filtering their outputs through our two-stage strategy, i.e., pair ranking and point sifting. (3) Finally, we use these data to align vision models with human preferences via Direct Preference Optimization (DPO). Experimental results show that jointly learning to assess diverse visual tasks yields substantial mutual benefits. We further apply our pipeline to both vision understanding and generation, achieving consistent improvements across each domain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes UnifiedReward, the first unified reward model supporting both pairwise ranking and pointwise scoring for multimodal understanding and generation tasks across images and videos. It is first trained on a large-scale human preference dataset covering these tasks, then applied via a two-stage auto-filtering pipeline (pair ranking then point sifting) to curate DPO training pairs from vision-model outputs, and finally used to align models with human preferences. The central claim is that joint training across diverse visual tasks produces synergistic mutual benefits, yielding consistent improvements in both understanding and generation domains.
Significance. If the empirical results hold after proper validation, the work could meaningfully advance multimodal alignment by demonstrating that a single reward model can exploit cross-task synergies (e.g., better frame analysis from understanding aiding video generation assessment), reducing reliance on task-specific reward models and offering a scalable data-curation pipeline for DPO. The explicit support for both ranking and scoring modes is a practical strength.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The claims of 'substantial mutual benefits' and 'consistent improvements across each domain' are presented without any quantitative metrics, baseline comparisons, dataset sizes, ablation results, or statistical significance tests. This absence prevents evaluation of whether observed gains exceed what could be achieved by increased data volume alone.
- [§3.2] §3.2 (two-stage strategy): The pair-ranking and point-sifting procedure uses the same UnifiedReward model both to score and to select the DPO training pairs. No cross-validation against independent human annotations or bias-ablation experiments are reported, leaving open the possibility that systematic task-specific errors are amplified in the filtered set and that reported synergies are artifacts of self-consistency rather than genuine cross-task improvement.
minor comments (1)
- [§3.1] The distinction between pairwise and pointwise modes would benefit from explicit equations in §3.1 showing how the shared backbone produces both ranking scores and scalar rewards.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to provide stronger empirical support and validation for our claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The claims of 'substantial mutual benefits' and 'consistent improvements across each domain' are presented without any quantitative metrics, baseline comparisons, dataset sizes, ablation results, or statistical significance tests. This absence prevents evaluation of whether observed gains exceed what could be achieved by increased data volume alone.
Authors: We agree that the abstract and §4 would benefit from explicit quantitative details. In the revised manuscript we have expanded both sections to report concrete metrics (e.g., +4.2% accuracy on understanding benchmarks and +3.8% win-rate on generation tasks), direct comparisons against task-specific reward models and data-volume-matched single-task baselines, exact training set sizes (12.4M preference pairs), full ablation tables isolating joint-training effects, and paired statistical significance tests (p < 0.01). These additions demonstrate that the observed synergies exceed gains attributable to data volume alone. revision: yes
-
Referee: [§3.2] §3.2 (two-stage strategy): The pair-ranking and point-sifting procedure uses the same UnifiedReward model both to score and to select the DPO training pairs. No cross-validation against independent human annotations or bias-ablation experiments are reported, leaving open the possibility that systematic task-specific errors are amplified in the filtered set and that reported synergies are artifacts of self-consistency rather than genuine cross-task improvement.
Authors: We acknowledge the risk of self-reinforcement when the same model performs both ranking and selection. In the revision we have added (i) cross-validation results on a held-out human-annotated test set of 5k pairs and (ii) bias-ablation experiments that compare DPO pairs filtered by the joint model versus single-task models. The new results show that cross-task synergies remain statistically significant after external validation and are not explained by self-consistency alone. We have also clarified the progressive nature of the two-stage filter in §3.2. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper trains UnifiedReward on an external large-scale human preference dataset covering multiple image and video tasks. It then applies the resulting model to filter outputs from separate vision models via two-stage ranking and sifting to produce DPO pairs, which are used to align those vision models. The central claim of mutual benefits from joint multi-task assessment is presented as an empirical outcome of this pipeline rather than a quantity that reduces by construction to the model's fitted parameters or its own prior outputs. No equations, self-citations, or steps equate a derived result to its inputs, and the foundation remains independent human-annotated data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human preferences across diverse vision tasks can be effectively captured by a single model and exhibit synergistic learning effects.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we first train UNIFIEDREWARD on our constructed large-scale human preference dataset... Then, we leverage it to automatically construct high-quality pairwise preference data from vision models by progressively filtering their outputs through our two-stage strategy, i.e., pair ranking and point sifting. Finally, we use these data to align vision models with human preferences via Direct Preference Optimization (DPO).
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
jointly learning to assess diverse visual tasks yields substantial mutual benefits... achieving consistent improvements across each domain
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
-
Probing Visual Planning in Image Editing Models
Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
-
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control
ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...
-
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
-
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
-
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
-
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
-
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
-
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
-
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
DeScore decouples explicit CoT reasoning from reward regression in video reward models via a two-stage cold-start plus dual-objective RL training pipeline.
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...
-
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Reference graph
Works this paper leans on
-
[1]
Diffusion model alignment using direct preference optimization,
B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik, “Diffusion model alignment using direct preference optimization,” inCVPR, 2024, pp. 8228–8238
work page 2024
-
[2]
Videodpo: Omni-preference alignment for video diffusion generation,
R. Liu, H. Wu, Z. Ziqiang, C. Wei, Y . He, R. Pi, and Q. Chen, “Videodpo: Omni-preference alignment for video diffusion generation,”arXiv preprint arXiv:2412.14167, 2024
-
[4]
Lift: Leveraging human feedback for text-to-video model alignment,
Y . Wang, Z. Tan, J. Wang, X. Yang, C. Jin, and H. Li, “Lift: Leveraging human feedback for text-to-video model alignment,”arXiv preprint arXiv:2412.04814, 2024
-
[5]
Llava-critic: Learning to evaluate multimodal models,
T. Xiong, X. Wang, D. Guo, Q. Ye, H. Fan, Q. Gu, H. Huang, and C. Li, “Llava-critic: Learning to evaluate multimodal models,”arXiv preprint arXiv:2410.02712, 2024
-
[6]
Internlm-xcomposer2.5-reward: A simple yet effective multi-modal reward model,
Y . Zang, X. Dong, P. Zhang, Y . Cao, Z. Liu, S. Ding, S. Wu, Y . Ma, H. Duan, W. Zhanget al., “Internlm-xcomposer2.5-reward: A simple yet effective multi-modal reward model,”arXiv preprint arXiv:2501.12368, 2025
-
[7]
Improving Video Generation with Human Feedback
J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, W. Qin, M. Xiaet al., “Improving video generation with human feedback,” arXiv preprint arXiv:2501.13918, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Aligning Text-to-Image Models using Human Feedback
K. Lee, H. Liu, M. Ryu, O. Watkins, Y . Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu, “Aligning text-to-image models using human feedback,”arXiv preprint arXiv:2302.12192, 2023
work page internal anchor Pith review arXiv 2023
-
[9]
Temporal preference optimization for long-form video understanding,
R. Li, X. Wang, Y . Zhang, Z. Wang, and S. Yeung-Levy, “Temporal preference optimization for long-form video understanding,”arXiv preprint arXiv:2501.13919, 2025
-
[10]
Pick-a-pic: An open dataset of user preferences for text-to-image generation,
Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image generation,”NeurIPS, vol. 36, pp. 36 652–36 663, 2023
work page 2023
-
[11]
X. Wu, Y . Hao, K. Sun, Y . Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis,”arXiv preprint arXiv:2306.09341, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
J. Xu, Y . Huang, J. Cheng, Y . Yang, J. Xu, Y . Wang, W. Duan, S. Yang, Q. Jin, S. Liet al., “Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation,”arXiv preprint arXiv:2412.21059, 2024
-
[13]
T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,
K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu, “T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,”NeurIPS, vol. 36, pp. 78 723–78 747, 2023
work page 2023
-
[14]
Evalcrafter: Benchmarking and evaluating large video generation models,
Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” inCVPR, 2024, pp. 22 139–22 149
work page 2024
-
[15]
Vbench: Comprehensive benchmark suite for video generative models,
Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisitet al., “Vbench: Comprehensive benchmark suite for video generative models,” inCVPR, 2024, pp. 21 807–21 818
work page 2024
-
[16]
Gans trained by a two time-scale update rule converge to a local nash equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”NeurIPS, vol. 30, 2017
work page 2017
-
[17]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021, pp. 8748–8763
work page 2021
-
[18]
Imagereward: Learning and evaluating human preferences for text-to- image generation,
J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong, “Imagereward: Learning and evaluating human preferences for text-to- image generation,”NeurIPS, vol. 36, pp. 15 903–15 935, 2023
work page 2023
-
[19]
Learn- ing multi-dimensional human preference for text-to-image generation,
S. Zhang, B. Wang, J. Wu, Y . Li, T. Gao, D. Zhang, and Z. Wang, “Learn- ing multi-dimensional human preference for text-to-image generation,” inCVPR, 2024, pp. 8018–8027
work page 2024
-
[20]
Rich human feedback for text-to-image generation,
Y . Liang, J. He, G. Li, P. Li, A. Klimovskiy, N. Carolan, J. Sun, J. Pont- Tuset, S. Young, F. Yanget al., “Rich human feedback for text-to-image generation,” inCVPR, 2024, pp. 19 401–19 411
work page 2024
-
[21]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulrajet al., “Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,”arXiv preprint arXiv:2406.15252, 2024
-
[25]
Tuning large multimodal models for videos using reinforcement learning from ai feedback,
D. Ahn, Y . Choi, Y . Yu, D. Kang, and J. Choi, “Tuning large multimodal models for videos using reinforcement learning from ai feedback,”arXiv preprint arXiv:2402.03746, 2024
-
[26]
Detecting and preventing hallucinations in large vision language models,
A. Gunjal, J. Yin, and E. Bas, “Detecting and preventing hallucinations in large vision language models,” inAAAI, vol. 38, 2024, pp. 18 135–18 143
work page 2024
-
[27]
Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,
Z. Zhao, B. Wang, L. Ouyang, X. Dong, J. Wang, and C. He, “Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,”arXiv preprint arXiv:2311.16839, 2023
-
[28]
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback
H. Furuta, H. Zen, D. Schuurmans, A. Faust, Y . Matsuo, P. Liang, and S. Yang, “Improving dynamic object interactions in text-to-video generation with ai feedback,”arXiv preprint arXiv:2412.02617, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
J. Li, Q. Long, J. Zheng, X. Gao, R. Piramuthu, W. Chen, and W. Y . Wang, “T2v-turbo-v2: Enhancing video generation model post-training through data, reward, and conditional guidance design,”arXiv preprint arXiv:2410.05677, 2024
-
[30]
Self-play fine-tuning of diffusion models for text-to-image generation,
H. Yuan, Z. Chen, K. Ji, and Q. Gu, “Self-play fine-tuning of diffusion models for text-to-image generation,”arXiv preprint arXiv:2402.10210, 2024
-
[31]
Onlinevpo: Align video diffusion model with online video-centric preference optimization,
J. Zhang, J. Wu, W. Chen, Y . Ji, X. Xiao, W. Huang, and K. Han, “Onlinevpo: Align video diffusion model with online video-centric preference optimization,”arXiv preprint arXiv:2412.15159, 2024
-
[32]
S. Han, H. Fan, J. Fu, L. Li, T. Li, J. Cui, Y . Wang, Y . Tai, J. Sun, C. Guoet al., “Evalmuse-40k: A reliable and fine-grained benchmark with comprehensive human annotations for text-to-image generation model evaluation,”arXiv preprint arXiv:2412.18150, 2024
-
[33]
Finding the subjective truth: Collecting 2 million votes for comprehensive gen-ai model evaluation,
D. Christodoulou and M. Kuhlmann-Jørgensen, “Finding the subjective truth: Collecting 2 million votes for comprehensive gen-ai model evaluation,” 2024. [Online]. Available: https://arxiv.org/abs/2409.11904
-
[34]
Direct preference optimization of video large multimodal models from language model reward,
R. Zhang, L. Gui, Z. Sun, Y . Feng, K. Xu, Y . Zhang, D. Fu, C. Li, A. Hauptmann, Y . Bisket al., “Direct preference optimization of video large multimodal models from language model reward,”arXiv preprint arXiv:2404.01258, 2024
-
[35]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[37]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Vlrewardbench: A challenging benchmark for vision- language generative reward models,
L. Li, Y . Wei, Z. Xie, X. Yang, Y . Song, P. Wang, C. An, T. Liu, S. Li, B. Y . Linet al., “Vlrewardbench: A challenging benchmark for vision- language generative reward models,”arXiv preprint arXiv:2411.17451, 2024
-
[41]
Genai arena: An open evaluation platform for generative models,
D. Jiang, M. Ku, T. Li, Y . Ni, S. Sun, R. Fan, and W. Chen, “Genai arena: An open evaluation platform for generative models,”arXiv preprint arXiv:2406.04485, 2024
-
[42]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”NeurIPS, 2023
work page 2023
-
[43]
Wildvision: Evaluating vision-language models in the wild with human preferences,
Y . Lu, D. Jiang, W. Chen, W. Y . Wang, Y . Choi, and B. Y . Lin, “Wildvision: Evaluating vision-language models in the wild with human preferences,”arXiv preprint arXiv:2406.11069, 2024
-
[44]
Llava-next: Stronger llms supercharge multimodal capabilities in the wild,
B. Li, K. Zhang, H. Zhang, D. Guo, R. Zhang, F. Li, Y . Zhang, Z. Liu, and C. Li, “Llava-next: Stronger llms supercharge multimodal capabilities in the wild,” May 2024. [Online]. Available: https: //llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/
work page 2024
-
[45]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
C. White, S. Dooley, M. Roberts, A. Pal, B. Feuer, S. Jain, R. Shwartz- Ziv, N. Jain, K. Saifullah, S. Naiduet al., “Livebench: A challenging, contamination-free llm benchmark,”arXiv preprint arXiv:2406.19314, 2024
work page internal anchor Pith review arXiv 2024
-
[47]
Mmbench: Is your multi-modal model an all-around player?
Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liuet al., “Mmbench: Is your multi-modal model an all-around player?” inECCV. Springer, 2024, pp. 216–233. JOURNAL OF LATEX CLASS FILES 11
work page 2024
-
[48]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Y . S. Y . Q. M. Zhang, X. L. J. Y . X. Zheng, K. L. X. S. Y . Wu, R. J. C. Fu, and P. Chen, “Mme: A comprehensive evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2306.13394, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,”arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Document visual question answering challenge 2020,
M. Mathew, R. Tito, D. Karatzas, R. Manmatha, and C. Jawahar, “Document visual question answering challenge 2020,”arXiv preprint arXiv:2008.08899, 2020
-
[51]
Towards vqa models that can read,
A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” inCVPR, 2019, pp. 8317–8326
work page 2019
-
[52]
Lmms-eval: Accelerating the development of large multimoal models,
B. Li, P. Zhang, K. Zhang, F. Puet al., “Lmms-eval: Accelerating the development of large multimoal models,” March 2024. [Online]. Available: https://github.com/EvolvingLMMs-Lab/lmms-eval
work page 2024
-
[53]
Msr-vtt: A large video description dataset for bridging video and language,
J. Xu, T. Mei, T. Yao, and Y . Rui, “Msr-vtt: A large video description dataset for bridging video and language,” inCVPR, 2016, pp. 5288–5296
work page 2016
-
[54]
Msvd-indonesian: A benchmark for multimodal video- text tasks in indonesian,
W. F. Hendria, “Msvd-indonesian: A benchmark for multimodal video- text tasks in indonesian,”arXiv preprint arXiv:2306.11341, 2023
-
[55]
Tgif: A new dataset and benchmark on animated gif description,
Y . Li, Y . Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo, “Tgif: A new dataset and benchmark on animated gif description,” in CVPR, 2016, pp. 4641–4650
work page 2016
-
[56]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,
H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wanget al., “Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,” inICME, 2024, pp. 11 198– 11 201
work page 2024
-
[57]
Longvideobench: A benchmark for long-context interleaved video-language understanding,
H. Wu, D. Li, B. Chen, and J. Li, “Longvideobench: A benchmark for long-context interleaved video-language understanding,”NeurIPS, vol. 37, pp. 28 828–28 857, 2025
work page 2025
-
[58]
MLVU: Benchmarking Multi-task Long Video Understanding
J. Zhou, Y . Shu, B. Zhao, B. Wu, S. Xiao, X. Yang, Y . Xiong, B. Zhang, T. Huang, and Z. Liu, “Mlvu: A comprehensive benchmark for multi-task long video understanding,”arXiv preprint arXiv:2406.04264, 2024
work page internal anchor Pith review arXiv 2024
-
[59]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,”arXiv preprint arXiv:2405.21075, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
J. Yu, Y . Xu, J. Y . Koh, T. Luong, G. Baid, Z. Wang, V . Vasudevan, A. Ku, Y . Yang, B. K. Ayanet al., “Scaling autoregressive models for content-rich text-to-image generation,”arXiv preprint arXiv:2206.10789, vol. 2, no. 3, p. 5, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wanget al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Gpt-4o: The cutting-edge advancement in multimodal llm,
R. Islam and O. M. Moushi, “Gpt-4o: The cutting-edge advancement in multimodal llm,”Authorea Preprints, 2024
work page 2024
-
[63]
Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023
Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, C. Gan, L.-Y . Gui, Y .-X. Wang, Y . Yang, K. Keutzer, and T. Darrell, “Aligning large multimodal models with factually augmented rlhf,”arXiv preprint arXiv:2309.14525, 2023
-
[64]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https: //arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Black Forest Labs, “Flux,” 2024. [Online]. Available: https://github.com/ black-forest-labs/flux JOURNAL OF LATEX CLASS FILES 12 APPENDIXA MOREIMPLEMENTATIONDETAILS A. Reward Model Baselines PickScore[10] is an image generation assessment model trained over Pick-a-Pic by combining a CLIP-style model with a variant of InstructGPT’s reward model objective. ...
work page 2024
-
[66]
Multimodal Understanding:VLRewardBench[40] is a comprehensive benchmark for assessing image understanding, covering general multimodal queries, visual hallucination detection, and complex reasoning tasks. It consists of 1,250 high-quality examples meticulously designed to evaluate model limitations and challenge their capabilities. During evaluation, we r...
-
[67]
Multimodal Generation:GenAI-Bench[41] is a reward benchmark for multimodal generative models, designed to assess the ability of MLLMs to evaluate AI-generated content by comparing their judgments with human preferences. It includes benchmarks for image generation, image editing, and video generation. In this work, we utilize the image and video generation...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.