EvoRubrics: Dynamic Rubrics as Rewards via Adversarial Co-Evolution for LLM Reinforcement Learning

Baixiang Huang; Hongxin Ding; Jinyang Zhang; Junfeng Zhao; Weibin Liao; Yasha Wang; Yue Fang; Zheng Li; Zhijing Wu

arxiv: 2606.23038 · v1 · pith:T2YUOL5Dnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI

EvoRubrics: Dynamic Rubrics as Rewards via Adversarial Co-Evolution for LLM Reinforcement Learning

Hongxin Ding , Baixiang Huang , Yue Fang , Weibin Liao , Zheng Li , Jinyang Zhang , Zhijing Wu , Junfeng Zhao

show 1 more author

Yasha Wang

This is my paper

Pith reviewed 2026-06-26 09:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords dynamic rubricsadversarial co-evolutionLLM reinforcement learningreward modelsautomatic curriculumself-supervised RLrubric-based rewards

0 comments

The pith

A policy LLM and rubric generator co-evolve through adversarial updates inside each RL training step to keep rewards informative as capabilities grow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoRubrics as a framework in which a policy model and a rubric generator train jointly, with the generator revising its criteria in direct response to the policy's latest outputs within the same step. This real-time adaptation prevents the reward signals from becoming useless once the policy surpasses the original fixed standards. The method produces stronger results than either unchanging rubrics or earlier dynamic approaches that update less frequently or require outside models. Even when no external supervision or ground-truth answers are supplied at all, the self-supervised version still records gains, indicating that the mutual pressure between generation and evaluation supplies enough signal on its own. The trained generator can later function as a reward model on new tasks without further adjustment.

Core claim

The central claim is that joint adversarial co-evolution of the policy LLM and the rubric generator inside every training step allows evaluation criteria to track policy improvements in real time. This produces an automatic curriculum, sustains discriminative power in the rewards, and supports effective learning even in a fully self-supervised regime that uses neither ground-truth answers nor frontier models.

What carries the argument

The adversarial co-evolution loop in which the rubric generator revises its criteria on the basis of the current policy outputs and the policy then optimizes against the updated rubrics, all within a single training step.

If this is right

Rubrics stay discriminative instead of saturating as the policy improves.
An automatic curriculum arises naturally from the real-time rubric adaptation.
The learned rubric generator transfers to new tasks as a standalone reward model.
Meaningful performance gains occur even in a version that receives no external supervision.
The method outperforms both static rubrics and prior dynamic-rubric baselines across the tested benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same closed-loop dynamic could reduce reliance on human-written rubrics or large external models when training agents for open-ended tasks.
Co-evolution of generation and evaluation might be tested on other signals such as code correctness or stylistic criteria.
The pattern suggests evaluation capabilities can be bootstrapped alongside generation capabilities without an external anchor.
Applying the approach at larger model scales or with different reinforcement-learning algorithms would show whether the within-step update timing remains stable.

Load-bearing premise

Joint adversarial updates inside each step can keep the rubric generator sufficiently ahead of the policy without external ground truth or frontier models.

What would settle it

A run in which the policy quickly learns to produce outputs that score highly on the simultaneously generated rubrics yet show no corresponding improvement on held-out task measures or human judgments.

Figures

Figures reproduced from arXiv: 2606.23038 by Baixiang Huang, Hongxin Ding, Jinyang Zhang, Junfeng Zhao, Weibin Liao, Yasha Wang, Yue Fang, Zheng Li, Zhijing Wu.

**Figure 1.** Figure 1: Static rubrics vs. EvoRubrics. Static rubrics provide non-adaptive evaluations and unstable rewards; EvoRubrics co-evolves a Rubric Generator and Policy LLM, yielding stable rewards and consistent gains. become a central paradigm for aligning and improving Large Language Models (LLMs) (Achiam et al., 2023; Yang et al., 2025), yet its success hinges on reliable reward signals, such as human preferences or … view at source ↗

**Figure 2.** Figure 2: EvoRubrics framework overview. At each step, the Policy LLM and Rubric Generator generate M candidate answers and N rubric sets via dual LoRA adapters; a judge model scores all answer-rubric pairs to form an M × N matrix; policy and rubric rewards are computed from this matrix; and both adapters are updated with GRPO, enabling real-time co-evolution. 4.2 Policy LLM Optimization Given a query q, the Policy … view at source ↗

read the original abstract

Rubric-based rewards offer interpretable and fine-grained optimization signals for reinforcement learning in open-ended tasks where verifiable answers are unavailable. However, pre-constructed rubrics remain static throughout training, creating a fundamental mismatch with the evolving policy: fixed criteria gradually lose discriminative power as the model improves, leading to reward saturation and potential hacking. Recent dynamic rubric methods partially address this but rely on external frontier models or ground-truth answers, and update rubrics only at coarse granularity. We propose EvoRubrics, a co-evolutionary RL framework where a Policy LLM and a Rubric Generator jointly improve through adversarial interaction within each training step. As the policy improves under the rubric generator's guidance, the rubric generator adapts its criteria to remain discriminative and informative, enabling evaluation to track the policy in real time and naturally inducing an automatic curriculum. Experiments show that EvoRubrics consistently outperforms static and dynamic rubric baselines across benchmarks. The learned Rubric Generator further generalizes as a transferable reward model. Notably, even a fully self-supervised variant without any external supervision achieves meaningful gains, suggesting that co-evolution between generation and evaluation alone can provide sufficiently rich learning signals. Our code is publicly available at https://anonymous.4open.science/r/EvoRubrics-2155/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoRubrics adds intra-step adversarial co-evolution between policy and rubric generator, which is a concrete step past prior dynamic-rubric work, but the self-supervised performance claims rest on experimental details that are not visible here.

read the letter

The main new piece is the joint adversarial update inside each training step, where the rubric generator has to stay ahead of the improving policy without external supervision or frontier models. That setup, plus the claim that a fully self-supervised version still produces usable gains, is not in the earlier dynamic-rubric papers the abstract cites.

The paper correctly identifies the saturation problem with fixed rubrics and shows how co-evolution could create an automatic curriculum. Making the learned rubric generator transferable is also a practical plus, and releasing the code is helpful.

The soft spot is the evidence. The abstract states consistent outperformance and meaningful self-supervised gains but gives no numbers, no baselines, no ablations, and no statistical tests. Without those, it is impossible to tell whether the co-evolutionary loop actually keeps the generator discriminative or whether the policy simply exploits a lagging rubric. The stress-test worry about update ordering and equilibrium stability is therefore still open.

This is aimed at people working on RL for open-ended LLM tasks who already know the static-rubric limitations. A reader who wants a worked-out alternative to external reward models could get value from the idea, but only after seeing the full experimental section.

It deserves peer review so the experiments can be checked; the core mechanism is worth testing even if the current write-up leaves the stability question unresolved.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes EvoRubrics, a co-evolutionary RL framework where a Policy LLM and Rubric Generator are jointly updated through adversarial interaction within each training step. The approach aims to maintain discriminative rubrics as the policy improves, addressing reward saturation in static rubrics and reliance on external models in prior dynamic methods. Central claims include consistent outperformance over static and dynamic rubric baselines across benchmarks, generalization of the learned Rubric Generator as a transferable reward model, and meaningful gains even for a fully self-supervised variant without external supervision or ground truth.

Significance. If the empirical results and stability of the co-evolutionary mechanism hold, the work could enable scalable adaptive rewards and automatic curricula for open-ended LLM tasks without frontier models or ground-truth answers. Public code release strengthens reproducibility.

major comments (3)

[Method] Method section (likely §3): the description of joint adversarial updates within each training step provides no equations, loss formulations, or pseudocode for the update ordering, generator loss, or stability condition. This is load-bearing for the self-supervised variant claim, as the skeptic concern about the generator lagging and enabling reward hacking cannot be evaluated.
[Experiments] Experiments section: the abstract and available text assert outperformance and generalization but report no quantitative metrics, specific baselines, statistical tests, ablation results on update dynamics, or details on how the self-supervised variant avoids saturation. These omissions prevent assessment of the central empirical claims.
[Results] §4 or equivalent (results on transferable reward model): the generalization claim for the Rubric Generator requires evidence that it was tested on held-out policies or tasks with controlled policy improvement rates; without such controls or metrics, the transfer result cannot be distinguished from baseline rubric quality.

minor comments (2)

[Abstract] The abstract states 'our code is publicly available' but the link is to an anonymous repository; ensure the final version includes a permanent, non-anonymous link with reproduction scripts.
[Introduction] Notation for the adversarial interaction (e.g., how the rubric generator's criteria are parameterized and optimized) should be introduced earlier and used consistently.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Method] Method section (likely §3): the description of joint adversarial updates within each training step provides no equations, loss formulations, or pseudocode for the update ordering, generator loss, or stability condition. This is load-bearing for the self-supervised variant claim, as the skeptic concern about the generator lagging and enabling reward hacking cannot be evaluated.

Authors: We agree that explicit formulations are necessary for rigorous evaluation, particularly regarding stability in the self-supervised case. The revised manuscript will include the mathematical definitions of the joint adversarial updates, the Rubric Generator loss, the policy objective, the precise ordering of updates within each step, and any stability analysis or conditions used to mitigate risks such as generator lag or reward hacking. revision: yes
Referee: [Experiments] Experiments section: the abstract and available text assert outperformance and generalization but report no quantitative metrics, specific baselines, statistical tests, ablation results on update dynamics, or details on how the self-supervised variant avoids saturation. These omissions prevent assessment of the central empirical claims.

Authors: The full manuscript contains quantitative results, baseline comparisons, and descriptions of the self-supervised variant. To strengthen the presentation, we will add statistical significance tests, expanded ablations on update dynamics, and explicit discussion of saturation avoidance mechanisms in the self-supervised setting. revision: partial
Referee: [Results] §4 or equivalent (results on transferable reward model): the generalization claim for the Rubric Generator requires evidence that it was tested on held-out policies or tasks with controlled policy improvement rates; without such controls or metrics, the transfer result cannot be distinguished from baseline rubric quality.

Authors: We will augment the results section with additional experiments and metrics demonstrating the Rubric Generator's performance on held-out policies and tasks. These will include controls for policy improvement rates and comparisons showing transfer gains beyond static baseline rubric quality. revision: yes

Circularity Check

0 steps flagged

No derivation chain; empirical performance claims only

full rationale

The paper describes an empirical co-evolutionary RL method and reports experimental outperformance on benchmarks, including a self-supervised variant. No first-principles derivation, uniqueness theorem, or prediction is presented that reduces by construction to fitted inputs, self-citations, or ansatzes. The central results are benchmark comparisons, not tautological re-statements of training objectives. Self-citations, if present, are not load-bearing for any claimed derivation. This is the expected non-finding for an applied empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on standard RL assumptions and the existence of a stable adversarial training dynamic between two LLMs.

pith-pipeline@v0.9.1-grok · 5781 in / 1107 out tokens · 17158 ms · 2026-06-26T09:22:41.419879+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 2 linked inside Pith

[1]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others

Rubrics as rewards: Reinforcement learn- ing beyond verifiable domains.arXiv preprint arXiv:2507.17746. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948...

Pith/arXiv arXiv 2025
[2]

Nils Reimers and Iryna Gurevych

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th interna...

arXiv 2019
[3]

Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, and 1 others

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, and 1 others. 2026. Rubricbench: Aligning model-generated rubrics with human standards.arXiv preprint arXiv:2603.01562. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhangha...

Pith/arXiv arXiv 2026
[4]

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, and 1 others

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, and 1 others
[5]

A EvoRubrics Algorithm We provide the pseudo codes for EvoRubrics algo- rithm as below

Breaking the exploration bottleneck: Rubric- scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949. A EvoRubrics Algorithm We provide the pseudo codes for EvoRubrics algo- rithm as below. Algorithm 1EVORUBRICS: Co-Evolutionary Training Require: Base LLM π, judge model J , reward weightsλ, datasetD 1: Initialize LoRA a...

arXiv 2025
[6]

points:-8 axis:accuracy

Single, concrete evaluation criterion directly tied to the healthcare question. points:-8 axis:accuracy
[7]

points:7 axis:completeness

Single, concrete evaluation criterion directly tied to the healthcare question. points:7 axis:completeness ... **AXIS DEFINITIONS (USE EXACTLY ONE PER RUBRIC):** - accuracy: Fac- tual correctness of medical information - completeness: Thoroughness of response - context_awareness: Understanding user’s situation - communication_quality: Clarity and appropri...
[8]

Mandatory rule X violation

Do NOT add any content after the last axis line. 6. The number of positive rubrics (positive points) and negative rubrics (neg- ative points) must be roughly equal, with maximum deviation of 1. 7. Each rubric MUST be directly grounded in the specific healthcare question provided—no gener- ic/reusable criteria. 8. Each axis MUST correspond logically to the...
[9]

points:-8 axis:accuracy

Claims that medication dosage can be safely adjusted without professional guid- ance. points:-8 axis:accuracy
[10]

points:8 axis:accuracy

Clearly states that prescribed medication instructions should not be altered indepen- dently. points:8 axis:accuracy
[11]

points:-6 axis:completeness

Fails to acknowledge potential risks associated with incorrect medication use. points:-6 axis:completeness
[12]

points:7 axis:instruction_following

Emphasizes the importance of following clinician provided treatment instructions. points:7 axis:instruction_following
[13]

points:-7 axis:context_awareness

Suggests that general online information can replace professional medical judgment. points:-7 axis:context_awareness
[14]

points:6 axis:communication_quality

Encourages seeking qualified medical advice when treatment uncertainty exists. points:6 axis:communication_quality
[15]

points:-5 axis:completeness

Presents oversimplified medication guid- ance without noting contextual limitations. points:-5 axis:completeness
[16]

points:5 axis:communication_quality

Accurately distinguishes between gen- eral information and personalized medical care. points:5 axis:communication_quality
[17]

points:-4 axis:accuracy

Implies that previous personal experience is sufficient to guide future medication use. points:-4 axis:accuracy
[18]

explanation

Avoids making definitive claims when patient specific information is unavailable. points:4 axis:communication_quality **SCORING GUIDANCE:** - Positive points (1-10): correct, helpful, safe recom- mendations - Negative points (-1 to -10): incorrect, harmful, or misleading informa- tion - Higher absolute values = greater im- portance/criticality **TASK** Ge...
[19]

Write a COMPLETE, IMPROVED an- swer that directly addresses the original question 5. Do NOT reference rubrics or scores in the final output **IMPORTANT OUTPUT RULES:** - Output ONLY the improved answer as if you are directly answering the original ques- tion - Do NOT generate evaluation rubrics, scores, or meta-commentary - Do NOT list numbered evaluation...
[20]

points: 8, axis: completeness

Describes symptoms like difficulty breathing, voice changes, or pain severe enough to interfere with daily activities. points: 8, axis: completeness
[21]

points: -7, axis: accuracy

Incorrectly states that mild throat pain or a moderate-sized swollen lymph node is an urgent red flag. points: -7, axis: accuracy
[22]

points: 7, axis: completeness

Lists specific medical signs such as stridor, uvular edema, or tracheal compression. points: 7, axis: completeness
[23]

points: -8, axis: context_awareness

Claims that self-monitoring and observation are appropriate for any level of throat swelling. points: -8, axis: context_awareness
[24]

points: 6, axis: communication_quality

Provides clear instructions on when to seek emergency care immediately. points: 6, axis: communication_quality
[25]

points: -5, axis: completeness

Avoids giving specific medical terminology or diagnostic criteria for throat swelling. points: -5, axis: completeness
[26]

points: -9, axis: accuracy

Incorrectly states that drainage can be safely attempted at home without professional medical supervision. points: -9, axis: accuracy
[27]

points: 9, axis: communication_quality

Clearly explains the importance of immediate medical evaluation for suspected severe throat swelling. points: 9, axis: communication_quality
[28]

points: 7, axis: completeness

Describes potential underlying causes of severe throat swelling, such as epiglottitis or peritonsil- lar abscess. points: 7, axis: completeness
[29]

points: 8, axis: completeness

Gives specific examples of emergency warning signs (e.g., drooling, neck rigidity, inability to speak). points: 8, axis: completeness
[30]

points: -8, axis: accuracy Table 9: Rubrics generated for a throat emergency red-flag question (Example 1)

Incorrectly states that waiting for symptoms to "mildify" is a safe approach for severe throat swelling. points: -8, axis: accuracy Table 9: Rubrics generated for a throat emergency red-flag question (Example 1). User Prompt gather references from systematic reviews on sugar or additive elimination diets in kids with adhd Generated Rubrics
[31]

points: -8, axis: accuracy

Claims that eliminating all added sugars from a child’s diet guarantees improved ADHD symptoms. points: -8, axis: accuracy
[32]

points: 8, axis: instruction_following

Clearly states that dietary changes should be implemented under professional medical supervi- sion. points: 8, axis: instruction_following
[33]

points: -6, axis: completeness

Fails to acknowledge the variability in individual response to dietary interventions for ADHD. points: -6, axis: completeness
[34]

points: 7, axis: instruction_following

Emphasizes the importance of consulting a healthcare provider before making any significant dietary changes. points: 7, axis: instruction_following
[35]

points: -7, axis: context_awareness

Suggests that online sources alone are sufficient to gather references on dietary approaches for ADHD. points: -7, axis: context_awareness
[36]

points: 6, axis: communication_quality

Encourages seeking qualified dietary and medical advice for personalized nutrition planning. points: 6, axis: communication_quality
[37]

points: -5, axis: completeness

Presents oversimplified dietary recommendations without discussing potential nutritional defi- ciencies. points: -5, axis: completeness
[38]

points: 5, axis: communication_quality

Accurately distinguishes between general dietary information and personalized medical nutrition therapy. points: 5, axis: communication_quality
[39]

points: -4, axis: context_awareness

Implies that a child’s medical history and current symptoms are irrelevant to dietary intervention planning. points: -4, axis: context_awareness
[40]

points: 4, axis: communication_quality

Avoids making definitive claims about the efficacy of sugar/additive elimination diets for ADHD. points: 4, axis: communication_quality
[41]

Whacked out

Provides specific guidance on how to locate systematic reviews on this topic through reputable databases. points: 10, axis: completeness Table 10: Rubrics generated for an ADHD elimination diet systematic review question (Example 2). User Prompt Hello! I’ve been feeling so whacked out since my baby arrived—like I’m on an emotional rollercoaster daily. I’m...

[1] [1]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others

Rubrics as rewards: Reinforcement learn- ing beyond verifiable domains.arXiv preprint arXiv:2507.17746. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948...

Pith/arXiv arXiv 2025

[2] [2]

Nils Reimers and Iryna Gurevych

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th interna...

arXiv 2019

[3] [3]

Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, and 1 others

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, and 1 others. 2026. Rubricbench: Aligning model-generated rubrics with human standards.arXiv preprint arXiv:2603.01562. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhangha...

Pith/arXiv arXiv 2026

[4] [4]

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, and 1 others

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, and 1 others

[5] [5]

A EvoRubrics Algorithm We provide the pseudo codes for EvoRubrics algo- rithm as below

Breaking the exploration bottleneck: Rubric- scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949. A EvoRubrics Algorithm We provide the pseudo codes for EvoRubrics algo- rithm as below. Algorithm 1EVORUBRICS: Co-Evolutionary Training Require: Base LLM π, judge model J , reward weightsλ, datasetD 1: Initialize LoRA a...

arXiv 2025

[6] [6]

points:-8 axis:accuracy

Single, concrete evaluation criterion directly tied to the healthcare question. points:-8 axis:accuracy

[7] [7]

points:7 axis:completeness

Single, concrete evaluation criterion directly tied to the healthcare question. points:7 axis:completeness ... **AXIS DEFINITIONS (USE EXACTLY ONE PER RUBRIC):** - accuracy: Fac- tual correctness of medical information - completeness: Thoroughness of response - context_awareness: Understanding user’s situation - communication_quality: Clarity and appropri...

[8] [8]

Mandatory rule X violation

Do NOT add any content after the last axis line. 6. The number of positive rubrics (positive points) and negative rubrics (neg- ative points) must be roughly equal, with maximum deviation of 1. 7. Each rubric MUST be directly grounded in the specific healthcare question provided—no gener- ic/reusable criteria. 8. Each axis MUST correspond logically to the...

[9] [9]

points:-8 axis:accuracy

Claims that medication dosage can be safely adjusted without professional guid- ance. points:-8 axis:accuracy

[10] [10]

points:8 axis:accuracy

Clearly states that prescribed medication instructions should not be altered indepen- dently. points:8 axis:accuracy

[11] [11]

points:-6 axis:completeness

Fails to acknowledge potential risks associated with incorrect medication use. points:-6 axis:completeness

[12] [12]

points:7 axis:instruction_following

Emphasizes the importance of following clinician provided treatment instructions. points:7 axis:instruction_following

[13] [13]

points:-7 axis:context_awareness

Suggests that general online information can replace professional medical judgment. points:-7 axis:context_awareness

[14] [14]

points:6 axis:communication_quality

Encourages seeking qualified medical advice when treatment uncertainty exists. points:6 axis:communication_quality

[15] [15]

points:-5 axis:completeness

Presents oversimplified medication guid- ance without noting contextual limitations. points:-5 axis:completeness

[16] [16]

points:5 axis:communication_quality

Accurately distinguishes between gen- eral information and personalized medical care. points:5 axis:communication_quality

[17] [17]

points:-4 axis:accuracy

Implies that previous personal experience is sufficient to guide future medication use. points:-4 axis:accuracy

[18] [18]

explanation

Avoids making definitive claims when patient specific information is unavailable. points:4 axis:communication_quality **SCORING GUIDANCE:** - Positive points (1-10): correct, helpful, safe recom- mendations - Negative points (-1 to -10): incorrect, harmful, or misleading informa- tion - Higher absolute values = greater im- portance/criticality **TASK** Ge...

[19] [19]

Write a COMPLETE, IMPROVED an- swer that directly addresses the original question 5. Do NOT reference rubrics or scores in the final output **IMPORTANT OUTPUT RULES:** - Output ONLY the improved answer as if you are directly answering the original ques- tion - Do NOT generate evaluation rubrics, scores, or meta-commentary - Do NOT list numbered evaluation...

[20] [20]

points: 8, axis: completeness

Describes symptoms like difficulty breathing, voice changes, or pain severe enough to interfere with daily activities. points: 8, axis: completeness

[21] [21]

points: -7, axis: accuracy

Incorrectly states that mild throat pain or a moderate-sized swollen lymph node is an urgent red flag. points: -7, axis: accuracy

[22] [22]

points: 7, axis: completeness

Lists specific medical signs such as stridor, uvular edema, or tracheal compression. points: 7, axis: completeness

[23] [23]

points: -8, axis: context_awareness

Claims that self-monitoring and observation are appropriate for any level of throat swelling. points: -8, axis: context_awareness

[24] [24]

points: 6, axis: communication_quality

Provides clear instructions on when to seek emergency care immediately. points: 6, axis: communication_quality

[25] [25]

points: -5, axis: completeness

Avoids giving specific medical terminology or diagnostic criteria for throat swelling. points: -5, axis: completeness

[26] [26]

points: -9, axis: accuracy

Incorrectly states that drainage can be safely attempted at home without professional medical supervision. points: -9, axis: accuracy

[27] [27]

points: 9, axis: communication_quality

Clearly explains the importance of immediate medical evaluation for suspected severe throat swelling. points: 9, axis: communication_quality

[28] [28]

points: 7, axis: completeness

Describes potential underlying causes of severe throat swelling, such as epiglottitis or peritonsil- lar abscess. points: 7, axis: completeness

[29] [29]

points: 8, axis: completeness

Gives specific examples of emergency warning signs (e.g., drooling, neck rigidity, inability to speak). points: 8, axis: completeness

[30] [30]

points: -8, axis: accuracy Table 9: Rubrics generated for a throat emergency red-flag question (Example 1)

Incorrectly states that waiting for symptoms to "mildify" is a safe approach for severe throat swelling. points: -8, axis: accuracy Table 9: Rubrics generated for a throat emergency red-flag question (Example 1). User Prompt gather references from systematic reviews on sugar or additive elimination diets in kids with adhd Generated Rubrics

[31] [31]

points: -8, axis: accuracy

Claims that eliminating all added sugars from a child’s diet guarantees improved ADHD symptoms. points: -8, axis: accuracy

[32] [32]

points: 8, axis: instruction_following

Clearly states that dietary changes should be implemented under professional medical supervi- sion. points: 8, axis: instruction_following

[33] [33]

points: -6, axis: completeness

Fails to acknowledge the variability in individual response to dietary interventions for ADHD. points: -6, axis: completeness

[34] [34]

points: 7, axis: instruction_following

Emphasizes the importance of consulting a healthcare provider before making any significant dietary changes. points: 7, axis: instruction_following

[35] [35]

points: -7, axis: context_awareness

Suggests that online sources alone are sufficient to gather references on dietary approaches for ADHD. points: -7, axis: context_awareness

[36] [36]

points: 6, axis: communication_quality

Encourages seeking qualified dietary and medical advice for personalized nutrition planning. points: 6, axis: communication_quality

[37] [37]

points: -5, axis: completeness

Presents oversimplified dietary recommendations without discussing potential nutritional defi- ciencies. points: -5, axis: completeness

[38] [38]

points: 5, axis: communication_quality

Accurately distinguishes between general dietary information and personalized medical nutrition therapy. points: 5, axis: communication_quality

[39] [39]

points: -4, axis: context_awareness

Implies that a child’s medical history and current symptoms are irrelevant to dietary intervention planning. points: -4, axis: context_awareness

[40] [40]

points: 4, axis: communication_quality

Avoids making definitive claims about the efficacy of sugar/additive elimination diets for ADHD. points: 4, axis: communication_quality

[41] [41]

Whacked out

Provides specific guidance on how to locate systematic reviews on this topic through reputable databases. points: 10, axis: completeness Table 10: Rubrics generated for an ADHD elimination diet systematic review question (Example 2). User Prompt Hello! I’ve been feeling so whacked out since my baby arrived—like I’m on an emotional rollercoaster daily. I’m...