MCPO applies contrastive learning to GRPO-style RL by treating cross-domain correct rollouts as positives and incorrect ones as negatives to improve multi-domain reasoning performance in LRMs.
Modomodo: Multi-domain data mixtures for multimodal LLM reinforcement learning.CoRR, abs/2505.24871, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2verdicts
UNVERDICTED 2representative citing papers
PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.
citing papers explorer
-
Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models
MCPO applies contrastive learning to GRPO-style RL by treating cross-domain correct rollouts as positives and incorrect ones as negatives to improve multi-domain reasoning performance in LRMs.
-
Perception-Aware Policy Optimization for Multimodal Reasoning
PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.