JudgmentBench supplies the first public paired rubric and preference annotations from legal experts on the same LLM outputs, showing comparative judgments outperform rubrics in recovering quality orderings.
Pairwise or pointwise? evaluating feedback protocols for bias in llm-based evaluation
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
citing papers explorer
-
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
JudgmentBench supplies the first public paired rubric and preference annotations from legal experts on the same LLM outputs, showing comparative judgments outperform rubrics in recovering quality orderings.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Trust Region On-Policy Distillation
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.