A multi-agent binary reward system with unbiased GRPO post-training on ICLR-320 data outperforms baselines on expert-rated novelty, feasibility, and effectiveness for scientific idea generation.
Configuration [11] (2 Analysts + 1 Evaluator) represents the optimal trade-off chosen for the main pipeline, achieving perfect precision (1.0) with robust recall (0.300)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training
A multi-agent binary reward system with unbiased GRPO post-training on ICLR-320 data outperforms baselines on expert-rated novelty, feasibility, and effectiveness for scientific idea generation.