arxiv: 2605.01630 · v1 · submitted 2026-05-02 · 💻 cs.CL · cs.AI

Recognition: unknown

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

Celio Larcher, Giovana Kerche Bon\'as, Hugo Abonizio, Marcos Piau, Ramon Pires, Rodrigo Nogueira, Roseval Malaquias Junior, Thales Sales Almeida, Thiago Laitz

Authors on Pith no claims yet

Pith reviewed 2026-05-09 13:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM evaluationrubric scoringjudge biasBrazilian Portuguesechat benchmarkmodel rankingmulti-turn conversationsWildChat

0 comments

The pith

Binary rubric scoring with multi-judge filtering produces consistent LLM rankings independent of the judge model chosen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Prosa, a benchmark of 1,000 real multi-turn Brazilian Portuguese user conversations drawn from WildChat. It tests 16 models scored by three judges from separate model families using both holistic scoring and a new binary rubric approach that includes filtering for judge agreement. Under the filtered rubric method the three judges produce identical rankings for all 16 models, while holistic scoring yields agreement on only 7 of the 16 ranks. The rubric pipeline also widens the average score difference between neighboring models by 47 percent, increasing the evaluation's ability to separate model quality. The work releases the benchmark and filtering code so new models can be tested under the same conditions at roughly $2.10 per evaluation.

Core claim

Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. Switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more than the judge model itself. On the Prosa benchmark of 1,000 WildChat conversations scored by three judges from three model families across 16 models, filtered rubric scoring produces full agreement on every rank while holistic scoring agrees on only 7 ranks and the filtering step increases the average gap between adjacent models by 47 percent.

What carries the argument

Binary rubric scoring combined with multi-judge filtering, which breaks each judgment into specific yes-no criteria and retains only scores where the judges align.

If this is right

LLM rankings remain the same regardless of which judge model performs the evaluation.
Score gaps between successive models grow by 47 percent on average, sharpening distinctions among close performers.
New models can be added to the benchmark at a fixed low cost using the released filtering code.
The same rubric-plus-filter approach can be reused for other open-ended generation tasks beyond Brazilian Portuguese.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be ported to chat benchmarks in additional languages to test whether judge independence generalizes.
Standardized rubrics might eventually replace ad-hoc holistic prompts in public leaderboards.
Testing the rubrics on newer model families would reveal whether the agreement holds as judge capabilities advance.
The method might surface subtle quality differences that holistic scoring currently masks.

Load-bearing premise

The chosen binary rubrics measure the quality dimensions that matter in real chats without being influenced by which judge model applies them, and the 1,000 conversations represent typical Brazilian Portuguese user interactions.

What would settle it

A new judge model producing a different ordering of the 16 models under the same filtered rubric pipeline would show the method has not eliminated judge sensitivity.

Figures

Figures reproduced from arXiv: 2605.01630 by Celio Larcher, Giovana Kerche Bon\'as, Hugo Abonizio, Marcos Piau, Ramon Pires, Rodrigo Nogueira, Roseval Malaquias Junior, Thales Sales Almeida, Thiago Laitz.

**Figure 1.** Figure 1: Conversation filtering pipeline reducing 3.2M WildChat conversations to the final 1,000 of Prosa. Solid borders mark rule-based filters (metadata, token length, and embedding similarity); dashed borders mark LLM-based filters (difficulty, topic, and intent clarity). A random sample finalises the benchmark. 3.1 Rule-based filtering Three rule-based stages on metadata, length, and embedding similarity preced… view at source ↗

**Figure 2.** Figure 2: Conversational composition of Prosa (n=1,000): (a) turn count distribution (51% multi-turn, mean 1.99 turns); (b) token counts (median 530, mean 990, longtailed) view at source ↗

**Figure 3.** Figure 3: Topical and geographic coverage of Prosa (n=1,000): (a) 11 topical categories from the Magpie taxonomy [19]; (b) all five Brazilian regions, with a Southeast majority. 4 Evaluation Protocol 4.1 Rubric generation We adopt the F.1 rubric-generation prompt from RRD [17], applied directly to our 1,000 filtered conversations through a Brazilian Portuguese translation of the template. We do not run RRD’s decompo… view at source ↗

**Figure 4.** Figure 4: Candidate rank across the three judges under (a) holistic scoring and (b) filtered rubric scoring. Holistic disagreement is sharpest at the top (Gemini 3 Flash places Gemini 3 Pro first, others place GPT-5.2 first); under rubric scoring, all three judges agree on every rank. Under holistic scoring the three judges do not agree on the ranking: Spearman averages 0.986 but only 7 of 16 candidates occupy the … view at source ↗

**Figure 5.** Figure 5: Per-category scores across the 11 topical categories of Prosa, with Sabiá-4 as judge under filtered rubric scoring (0–100 scale). 6 Conclusion Prosa is the first real user multi-turn Brazilian Portuguese chat benchmark with rubric-based LLM-as-judge evaluation. Its central methodological lesson is that decomposing the judgement into atomic binary verdicts matters more than the choice of judge model: three … view at source ↗

read the original abstract

Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more than the judge model itself. To support this claim, we introduce Prosa, the first real user multi-turn Brazilian Portuguese chat benchmark: 1,000 WildChat conversations scored by three judges from three model families on 16 models. Under filtered rubric scoring the three judges agree on every one of the 16 ranks, whereas under holistic scoring they agree on only 7 of 16. Additionally, the rubric filtering pipeline increases the average score gap between neighbouring models by 47%, thereby improving Prosa's discriminative power. Evaluating a new model on Prosa costs approximately $2.1 when using Gemini 3 Flash as the judge. We release the benchmark and the filtering code to ensure that future models can be assessed under identical conditions. These artifacts also make our rubric-based scoring method reusable beyond Prosa, supporting other open-ended evaluation settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prosa adds a usable Brazilian Portuguese chat benchmark and shows rubric filtering can produce full cross-judge rank agreement, but the filter may be doing the heavy lifting by selecting low-disagreement cases.

read the letter

The main things to know are that this paper releases the first public set of 1,000 real multi-turn Brazilian Portuguese user chats drawn from WildChat, and it reports that binary rubric scoring plus multi-judge filtering produces identical rankings of 16 models across three judges from different families, while holistic scoring only reaches agreement on 7 of 16. They also note a 47% larger average gap between neighboring models under the filtered approach and give a rough cost of $2.10 per new model evaluation with Gemini 3 Flash as judge. They release both the data and the filtering code, which is the practical part worth paying attention to.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Prosa, a benchmark of 1,000 real-user multi-turn Brazilian Portuguese conversations sourced from WildChat. It compares holistic LLM-as-a-judge scoring against binary rubric scoring with a multi-judge filtering pipeline across three judges and 16 evaluated models. The key results are that filtered rubric scoring yields perfect agreement (16/16) on model rankings among the three judges, versus only 7/16 for holistic scoring, and that the filtering increases the average neighboring-model score gap by 47%. The benchmark and filtering code are released.

Significance. If the central claim holds after clarifying the role of filtering, the work would be significant for the field of LLM evaluation. It offers a practical approach to reduce judge-model bias in open-ended chat assessment, particularly valuable for Brazilian Portuguese where such benchmarks are scarce. The release of the dataset, rubrics, and code supports reproducibility and reuse in other settings, which is a clear strength.

major comments (2)

[§4 and §3.3] §4 (Results) and §3.3 (Filtering pipeline): The reported perfect rank agreement (16/16) under filtered rubric scoring is presented without the corresponding inter-judge rank agreement figures for the unfiltered rubric scores on the full set of 1,000 conversations. This omission makes it impossible to isolate whether the improvement stems from rubric decomposition or from the filter retaining only low-disagreement conversations, which is load-bearing for the abstract claim that 'decomposing the judgement matters more than the judge model itself'.
[§5] §5 (Gap analysis) and associated table: The 47% increase in average score gap between neighbouring models is reported for the rubric filtering pipeline, but it is unclear whether the comparison uses the same filtered conversation subset for both rubric and holistic methods or whether the number of retained conversations (and any statistical test for the gap change) is provided. This detail is needed to assess the scope and robustness of the discriminability claim.

minor comments (2)

[§3.2] The exact binary rubric definitions for each quality dimension (e.g., the specific criteria and binary thresholds) are referenced but not fully enumerated in the main text; moving the complete rubric table to the appendix or a dedicated figure would improve clarity and reusability.
[§3.3] The paper should explicitly state the number of conversations retained after each stage of the multi-judge filtering pipeline to allow readers to evaluate selection effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help strengthen the clarity of our claims regarding the contributions of rubric decomposition versus filtering. We address each major comment below and will revise the manuscript accordingly to improve transparency and reproducibility.

read point-by-point responses

Referee: [§4 and §3.3] §4 (Results) and §3.3 (Filtering pipeline): The reported perfect rank agreement (16/16) under filtered rubric scoring is presented without the corresponding inter-judge rank agreement figures for the unfiltered rubric scores on the full set of 1,000 conversations. This omission makes it impossible to isolate whether the improvement stems from rubric decomposition or from the filter retaining only low-disagreement conversations, which is load-bearing for the abstract claim that 'decomposing the judgement matters more than the judge model itself'.

Authors: We agree that reporting the unfiltered rubric inter-judge rank agreement is essential for isolating the effect of rubric decomposition from the filtering step. In the revised manuscript, we will add these results to Section 4, including the rank agreement figures for unfiltered rubric scores across the three judges on the full set of 1,000 conversations. This addition will allow direct comparison with both the filtered rubric results and the holistic baseline, thereby supporting a clearer evaluation of the claim. revision: yes
Referee: [§5] §5 (Gap analysis) and associated table: The 47% increase in average score gap between neighbouring models is reported for the rubric filtering pipeline, but it is unclear whether the comparison uses the same filtered conversation subset for both rubric and holistic methods or whether the number of retained conversations (and any statistical test for the gap change) is provided. This detail is needed to assess the scope and robustness of the discriminability claim.

Authors: We appreciate this observation on the need for precise methodological details. The 47% increase compares the average neighboring-model score gap under the full rubric filtering pipeline (applied to the retained conversations) against the gap under holistic scoring on the complete set of 1,000 conversations, as the filtering is an integral component of the proposed rubric-based method rather than an optional post-processing step. In the revision, we will explicitly state the number of conversations retained after filtering, add a statistical test (such as a paired t-test on the per-model-pair gaps) to assess the significance of the increase, and include a note clarifying the subsets used for each scoring approach. If space permits, we will also report the holistic gap on the filtered subset as a supplementary comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper reports direct empirical measurements of inter-judge rank agreement and score gaps on a fixed set of 1,000 conversations under two scoring regimes (holistic vs. filtered rubric). These quantities are computed from the observed judge outputs rather than derived from any self-referential equations, fitted parameters that are then renamed as predictions, or load-bearing self-citations whose validity depends on the present work. The filtering pipeline is presented as an explicit methodological step whose effects are measured on the data; no derivation chain reduces the headline agreement numbers (16/16 vs. 7/16) or the 47% gap improvement to a tautology by construction. This is a standard empirical evaluation paper whose central claims rest on side-by-side comparison rather than mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the design of the 16 rubrics and the assumption that WildChat conversations are suitable proxies for real Brazilian Portuguese user behavior; these elements are introduced by the authors without external validation in the abstract.

axioms (1)

domain assumption Binary rubric scoring decomposes judgment into aspects that are less sensitive to judge-model bias than holistic scoring.
Invoked to explain why filtered rubric scoring removes sensitivity to the judge model.

pith-pipeline@v0.9.0 · 5526 in / 1353 out tokens · 37946 ms · 2026-05-09T13:59:47.361421+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 16 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2403.09887 (2024), https: //arxiv.org/abs/2403.09887

Almeida, T.S., Abonizio, H., Nogueira, R., Pires, R.: Sabiá-2: a new generation of Portuguese large language models. arXiv preprint arXiv:2403.09887 (2024), https: //arxiv.org/abs/2403.09887

work page arXiv 2024
[2]

In: Naldi, M.C., Bianchi, R.A.C

Almeida, T.S., Laitz, T., Bonás, G.K., Nogueira, R.: BLUEX: a benchmark based on Brazilian leading universities entrance exams. In: Naldi, M.C., Bianchi, R.A.C. (eds.) Intelligent Systems. pp. 337–347. Springer Nature Switzerland (2023). https: //doi.org/10.1007/978-3-031-45368-7_22

work page doi:10.1007/978-3-031-45368-7_22 2023
[3]

arXiv preprint arXiv:2511.17808 (2025), https://arxiv.org/abs/2511.17808

Almeida,T.S.,Pires,R.,Abonizio,H.,Nogueira,R.,Pedrini,H.:PoETav2:Toward more robust evaluation of large language models in Portuguese. arXiv preprint arXiv:2511.17808 (2025), https://arxiv.org/abs/2511.17808

work page arXiv 2025
[4]

In: 2018 7th Brazilian Conference on Intelligent Systems (BRACIS)

Cataneo Silveira, I., Deratani Mauá, D.: Advances in automatically solving the ENEM. In: 2018 7th Brazilian Conference on Intelligent Systems (BRACIS). pp. 43–48 (2018). https://doi.org/10.1109/BRACIS.2018.00016

work page doi:10.1109/bracis.2018.00016 2018
[5]

arXiv preprint arXiv:2603.03543 (2026), https://arxiv.org/abs/2603.03543

Corrêa, N.K., et al.: Tucano 2 Cool: Better open source LLMs for Portuguese. arXiv preprint arXiv:2603.03543 (2026), https://arxiv.org/abs/2603.03543

work page arXiv 2026
[6]

In: First Conference on Language Modeling (COLM) (2024), https: //openreview.net/forum?id=CybBmzWBX0

Dubois, Y., et al.: Length-controlled AlpacaEval: a simple way to debias automatic evaluators. In: First Conference on Language Modeling (COLM) (2024), https: //openreview.net/forum?id=CybBmzWBX0

2024
[7]

The B i GG en bench: A principled benchmark for fine-grained evaluation of language models with language models

Kim, S., et al.: The BiGGen Bench: a principled benchmark for fine-grained evalua- tion of language models with language models. In: Proceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Long Papers). pp. 5877–5919 (2025). https://doi.org/10.18653/v1/2025....

work page doi:10.18653/v1/2025.naacl-long.303 2025
[8]

In: Proceedings of the 42nd International Conference on Machine Learning (ICML)

Li, T., et al.: From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline. In: Proceedings of the 42nd International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 267, pp. 34209–34231 (2025), https://proceedings.mlr.press/v267/li25h.html Prosa 15

2025
[9]

In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=MKEHCx25xp

Lin, B.Y., et al.: Wildbench: Benchmarking LLMs with challenging tasks from real users in the wild. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=MKEHCx25xp

2025
[10]

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Liu, C.W., Lowe, R., Serban, I.V., Noseworthy, M., Charlin, L., Pineau, J.: How NOT To evaluate your dialogue system: An empirical study of unsupervised evalu- ation metrics for dialogue response generation. In: Proceedings of the 2016 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP). pp. 2122– 2132 (2016). https://doi.org/10.18653...

work page doi:10.18653/v1/d16-1230 2016
[11]

In: Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, Day 1

Ma, Q., Wei, J.T.Z., Bojar, O., Graham, Y.: Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In: Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, Day 1. pp. 62–90 (2019). https://doi.org/10.18653/v1/W19-5302

work page doi:10.18653/v1/w19-5302 2019
[12]

arXiv preprint arXiv:2303.17003 (2023), https://arxiv.org/abs/2303.17003

Nunes, D., Primi, R., Pires, R., Lotufo, R., Nogueira, R.: Evaluating GPT- 3.5 and GPT-4 models on Brazilian university admission exams. arXiv preprint arXiv:2303.17003 (2023), https://arxiv.org/abs/2303.17003

work page arXiv 2023
[13]

arXiv preprint arXiv:2504.21202 (2025), https://arxiv.org/abs/2504.21202

Pires, R., Malaquias Junior, R., Nogueira, R.: Automatic legal writing evaluation of LLMs. arXiv preprint arXiv:2504.21202 (2025), https://arxiv.org/abs/2504.21202

work page arXiv 2025
[14]

In: Findings of the Association for Computational Linguistics: ACL

Qin, Y., et al.: InFoBench: Evaluating instruction following ability in large lan- guage models. In: Findings of the Association for Computational Linguistics: ACL
[15]

13025–13048 (2024)

pp. 13025–13048 (2024). https://doi.org/10.18653/v1/2024.findings-acl.772

work page doi:10.18653/v1/2024.findings-acl.772 2024
[16]

Computational Linguistics , volume =

Reiter, E.: A structured review of the validity of BLEU. Computational Linguistics 44(3), 393–401 (2018). https://doi.org/10.1162/COLI_a_00322

work page doi:10.1162/coli_a_00322 2018
[17]

Metacognitive Prompting Improves Understanding in Large Language Models

Sellam, T., Das, D., Parikh, A.P.: BLEURT: Learning robust metrics for text gen- eration. In: Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics (ACL). pp. 7881–7892 (2020). https://doi.org/10.18653/v1/ 2020.acl-main.704

work page doi:10.18653/v1/ 2020
[18]

Shen, Xinchi Qiu, Chenxi Whitehouse, Lisa Alazraki, Shashwat Goel, Francesco Barbieri, Timon Willi, Akhil Mathur, and Ilias Leontiadis

Shen, W.F., et al.: Rethinking rubric generation for improving LLM judge and reward modeling for open-ended tasks. arXiv preprint arXiv:2602.05125 (2026), https://arxiv.org/abs/2602.05125

work page arXiv 2026
[19]

Primack, Summer Yue, and Chen Xing

Sirdeshmukh, V., et al.: Multichallenge: a realistic multi-turn conversation eval- uation benchmark challenging to frontier LLMs. In: Findings of the Associa- tion for Computational Linguistics: ACL 2025. pp. 18632–18702 (2025). https: //doi.org/10.18653/v1/2025.findings-acl.958

work page doi:10.18653/v1/2025.findings-acl.958 2025
[20]

In: The Thirteenth International Conference on Learning Rep- resentations (2025), https://openreview.net/forum?id=Pnk7vMbznK

Xu, Z., et al.: Magpie: Alignment data synthesis from scratch by prompting aligned LLMs with nothing. In: The Thirteenth International Conference on Learning Rep- resentations (2025), https://openreview.net/forum?id=Pnk7vMbznK

2025
[21]

In: The Twelfth International Conference on Learning Representations (ICLR) (2024), https://openreview.net/forum?id=CYmF38ysDa

Ye, S., et al.: FLASK: Fine-grained language model evaluation based on alignment skill sets. In: The Twelfth International Conference on Learning Representations (ICLR) (2024), https://openreview.net/forum?id=CYmF38ysDa

2024
[22]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Zhang, Y., et al.: Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176 (2025), https://arxiv. org/abs/2506.05176

work page internal anchor Pith review arXiv 2025
[23]

In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=Bl8u7ZRlbM

Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y., Deng, Y.: Wildchat: 1m chat- GPT interaction logs in the wild. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=Bl8u7ZRlbM

2024
[24]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Zheng, L., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 46595 – 46623. Curran As- sociates, Inc. (2023), https://proceedings.neurips.cc/paper_files/paper/2023/file/ 91f18a1287b398d378ef22505bf4...

2023