AMARIS augments rubric-based RL with long-term evaluation memory and dual retrieval to update rubrics, outperforming baselines across domains with ~5% overhead.
13025–13048 (2024)
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
background 1polarities
background 1representative citing papers
Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
citing papers explorer
-
AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
AMARIS augments rubric-based RL with long-term evaluation memory and dual retrieval to update rubrics, outperforming baselines across domains with ~5% overhead.
-
Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese
Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
-
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.