BalancedDPO: Adaptive Multi-Metric Alignment

Dipesh Tamboli , Souradip Chakraborty , Aditya Malusare , Biplab Banerjee , Amrit Singh Bedi , Vaneet Aggarwal

Authors on Pith no claims yet

classification 💻 cs.CV cs.AI

keywords preferencealignmentbalanceddpoacrossdiffusionstableaggregationdynamic

read the original abstract

Diffusion models have achieved remarkable progress in text-to-image generation, yet aligning them with human preference remains challenging due to the presence of multiple, sometimes conflicting, evaluation metrics (e.g., semantic consistency, aesthetics, and human preference scores). Existing alignment methods typically optimize for a single metric or rely on scalarized reward aggregation, which can bias the model toward specific evaluation criteria. To address this challenge, we propose BalancedDPO, a framework that achieves multi-metric preference alignment within the Direct Preference Optimization (DPO) paradigm. Unlike prior DPO variants that rely on a single metric, BalancedDPO introduces a majority-vote consensus over multiple preference scorers and integrates it directly into the DPO training loop with dynamic reference model updates. This consensus-based formulation avoids reward-scale conflicts and ensures more stable gradient directions across heterogeneous metrics. Experiments on Pick-a-Pic, PartiPrompt, and HPD datasets demonstrate that BalancedDPO consistently improves preference win rates over the baselines across Stable Diffusion 1.5, Stable Diffusion 2.1 and SDXL backbones. Comprehensive ablations further validate the benefits of majority-vote aggregation and dynamic reference updating, highlighting the method's robustness and generalizability across diverse alignment settings.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pareto-Guided Optimal Transport for Multi-Reward Alignment
cs.CV 2026-05 unverdicted novelty 7.0

PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.