13025–13048 (2024)

Qin, Yiwei, Song, Kaiqiang, Hu, Yebowen, Yao, Wenlin, Cho, Sangwoo, Wang, Xiaoyang · 2024 · DOI 10.18653/v1/2024.findings-acl.772

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

AMARIS augments rubric-based RL with long-term evaluation memory and dual retrieval to update rubrics, outperforming baselines across domains with ~5% overhead.

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

cs.CL · 2026-05-02 · conditional · novelty 7.0

Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

cs.CL · 2026-04-28 · accept · novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

citing papers explorer

Showing 3 of 3 citing papers.

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 16
AMARIS augments rubric-based RL with long-term evaluation memory and dual retrieval to update rubrics, outperforming baselines across domains with ~5% overhead.
Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese cs.CL · 2026-05-02 · conditional · none · ref 15
Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models cs.CL · 2026-04-28 · accept · none · ref 15
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

13025–13048 (2024)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer