Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

Hyeji Choi; Minwoo Kim; YongTaek Lim

arxiv: 2606.09165 · v1 · pith:YW2IHBRHnew · submitted 2026-06-08 · 💻 cs.AI

Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

Yongtaek Lim , Hyeji Choi , Minwoo Kim This is my paper

Pith reviewed 2026-06-27 16:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords safety judgesrubric followingcurriculum learningdynamic rubricsAI safety evaluationmodel evaluationinstruction following

0 comments

The pith

A reliable-to-expressive curriculum trains 12B judges to apply varying safety rubrics consistently, reaching 94%+ accuracy with 0.76 cross-rubric variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Safety judges show large error swings when rubric wording changes because they often memorize fixed templates instead of learning to apply given criteria. The paper treats safety judgment as a rubric-following task and introduces instance-conditioned dynamic rubrics generated from prompt-response-label triples together with a curriculum that starts on clean fixed-rubric data before adding noisier dynamic examples. This produces a 12B model whose accuracy stays between 94.12% and 94.88% across three different rubric styles while keeping variance at 0.76, outperforming both general LLMs and larger dedicated judges. An ablation confirms that mixing dynamic rubrics without the curriculum schedule raises variance, whereas the ordered progression recovers and improves on the fixed-rubric baseline.

Core claim

Safety judgment is a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize one specific template. Instance-conditioned dynamic rubrics generated from prompt-response-label triples expose the model to criterion variability, and a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision before introducing noisier dynamic-rubric data produces judges whose accuracy remains high and stable when the same instances are scored under contrasting rubric prompts.

What carries the argument

The reliable-to-expressive curriculum, which begins with clean fixed-rubric supervision and progressively introduces noisier dynamic-rubric data generated from prompt-response-label triples.

If this is right

Naive mixing of dynamic rubrics raises cross-rubric variance from 1.44 to 3.60, while the curriculum schedule reduces it to 0.76.
The 12B curriculum judge exceeds the peak accuracy and stability of general-purpose LLMs, dedicated safety classifiers, and reasoning-oriented judges up to 30B.
Performance remains consistent when the same human-labeled set is scored under HarmBench-style, ShieldGemma-style, and domain-specific rubrics.
The curriculum recovers and improves on the fixed-rubric baseline where simple dynamic-rubric mixing fails.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same curriculum pattern could be tested on other instruction-following tasks that require consistent application of varying criteria.
Specialized smaller models may suffice for evaluation roles once rubric variability is addressed through ordered training.
Extending the method to rubrics drawn from entirely new domains would test whether the dynamic-rubric generation generalizes beyond the training distribution.

Load-bearing premise

The generated dynamic rubrics from prompt-response-label triples expose the judge to a representative sample of evaluation-criteria variability.

What would settle it

A new rubric style, never derived from the training triples, on which the curriculum judge shows accuracy variance above 1.0 or peak accuracy below the fixed-rubric baseline.

Figures

Figures reproduced from arXiv: 2606.09165 by Hyeji Choi, Minwoo Kim, YongTaek Lim.

**Figure 1.** Figure 1: Overview of our rubric-following safety judge setup. The figure illustrates the shift from single fixed-rubric supervision toward a reliable-to-expressive curriculum that gradually introduces instance-conditioned dynamic rubrics, and summarizes the resulting gains in cross-rubric robustness. nerability: these judges often fail under distribution shift and prompt variation. Eiras et al. (2025) report that … view at source ↗

**Figure 2.** Figure 2: The rubric-following problem. Given a (prompt, response) pair and an out-of-distribution rubric—a rubric whose phrasing or evaluation criterion was not seen during training—a standard safety judge produces an unstable, frequently incorrect verdict (top, ×), whereas our curriculum-trained judge applies the supplied rubric and produces a stable, correct verdict (bottom, ✓). A robust safety judge’s verdict mu… view at source ↗

**Figure 3.** Figure 3: Per-rubric ablation across three training strategies. Accuracy (left) and unsafe-class recall (right) are shown for HarmBenchstyle, ShieldGemma-style, and domain-specific rubrics. Naively mixing fixed and dynamic rubrics hurts every rubric, whereas the staged curriculum recovers and exceeds the fixed-only baseline across all metrics. the cross-rubric F1 range (Equation (1)). The F1 view corroborates the … view at source ↗

read the original abstract

Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they remain brittle under prompt and rubric variation, with false negative-rate swings of up to 0.24 reported for stylistic perturbations alone. We argue that safety judgment is fundamentally a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize one specific template. We propose a training strategy that combines (i) instance-conditioned dynamic rubrics generated from prompt-response-label triples to expose the judge to the variability of evaluation criteria, and (ii) a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision and progressively introduces noisier dynamic-rubric data. We evaluate on a single human-labeled set under three contrasting rubric prompts (HarmBench-style, ShieldGemma-style, and a domain-specific rubric). Our 12B curriculum judge achieves 94.12-94.88% accuracy across the three rubrics with a cross-rubric range of only 0.76, outperforming general-purpose LLMs, dedicated safety classifiers, and reasoning-oriented judges up to 30B in both peak accuracy and stability. An ablation shows that naively mixing dynamic rubrics into SFT increases cross rubric variance (1.44 -> 3.60); only the curriculum schedule recovers and improves on the fixed rubric baseline (variance 0.76).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Curriculum with dynamic rubrics cuts cross-rubric variance on safety judgment but the representativeness of those rubrics is not checked.

read the letter

The paper shows a 12B model trained with a reliable-to-expressive curriculum reaches 94.12-94.88% accuracy across three rubric styles and keeps the range to 0.76. That beats both general LLMs and larger reasoning models on stability.

What is new is the specific pairing of instance-conditioned dynamic rubrics generated from prompt-response-label triples with a staged curriculum that starts clean and adds noisier data later. The ablation is the clearest part: naive mixing raises variance from 1.44 to 3.60 while the curriculum schedule brings it down to 0.76.

The numbers are concrete and the setup is straightforward. The claim that this improves rubric-following holds on the reported test.

The main gap is that nothing verifies the dynamic rubrics actually sample the same space of criteria changes that appear in real use. Evaluation stays on three fixed styles and one human-labeled set, so the low variance result does not yet speak to arbitrary rubric variation.

This is for groups running automated safety checks on deployed models. The empirical design is clear enough that a serious referee should see it, even with the representativeness question left open.

Referee Report

2 major / 2 minor

Summary. The paper claims that safety judgment is a rubric-following task and proposes a two-part training strategy: (i) generating instance-conditioned dynamic rubrics from prompt-response-label triples to expose the model to criterion variability, and (ii) a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision before introducing noisier dynamic data. On a single human-labeled evaluation set under three fixed rubric styles (HarmBench-style, ShieldGemma-style, domain-specific), the resulting 12B model reaches 94.12–94.88% accuracy with a cross-rubric range of only 0.76, outperforming general LLMs, safety classifiers, and reasoning judges up to 30B; an ablation shows that naive mixing of dynamic rubrics increases variance while the curriculum schedule reduces it below the fixed-rubric baseline.

Significance. If the representativeness assumption holds, the work supplies a concrete, empirically validated curriculum method for improving stability of safety judges under rubric variation—an issue highlighted by prior meta-evaluations. The explicit contrast between curriculum ordering and naive mixing, together with the stability metric (cross-rubric range) and comparisons against multiple strong baselines, constitutes a useful methodological contribution. The concrete accuracy figures and ablation results are strengths that would be citable if the distributional claim is substantiated.

major comments (2)

[Training strategy description] Training strategy description: the central claim that instance-conditioned dynamic rubrics expose the judge to representative variability of evaluation criteria is load-bearing for both the curriculum and the robustness conclusion, yet the manuscript reports no quantitative verification (lexical, structural, or semantic diversity metrics) that the generated rubrics span the same space of criterion changes that would appear in deployment; evaluation stability is measured only across three fixed test rubrics on one dataset.
[Evaluation section] Evaluation section: the reported 0.76 cross-rubric range and outperformance claims rest on a single held-out human-labeled set; without additional datasets or an analysis of how the three test rubrics relate to the distribution of dynamic rubrics seen in training, it is unclear whether the stability generalizes beyond the specific test styles.

minor comments (2)

The abstract and evaluation paragraphs should explicitly state the total number of examples, the exact progression schedule of the curriculum (e.g., fraction of dynamic data per stage), and the precise definition of the cross-rubric range metric.
Figure or table presenting the ablation results should include per-rubric accuracies for all compared models, not only the final range, to allow readers to assess whether gains are uniform or driven by particular rubric styles.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional evidence would strengthen the claims about rubric variability and generalization. We respond point-by-point below and indicate where revisions are feasible.

read point-by-point responses

Referee: [Training strategy description] Training strategy description: the central claim that instance-conditioned dynamic rubrics expose the judge to representative variability of evaluation criteria is load-bearing for both the curriculum and the robustness conclusion, yet the manuscript reports no quantitative verification (lexical, structural, or semantic diversity metrics) that the generated rubrics span the same space of criterion changes that would appear in deployment; evaluation stability is measured only across three fixed test rubrics on one dataset.

Authors: We agree that explicit quantitative verification of rubric diversity (e.g., lexical, structural, or semantic metrics) is absent and would better support the claim that dynamic rubrics capture representative criterion variability. The generation process conditions on prompt-response-label triples to induce content-specific changes, but without computed diversity statistics this remains unverified. We will add such metrics and a comparison of rubric characteristics in revision. The three test rubrics were selected as contrasting styles to measure stability, though we acknowledge they do not exhaustively represent all deployment variations. revision: partial
Referee: [Evaluation section] Evaluation section: the reported 0.76 cross-rubric range and outperformance claims rest on a single held-out human-labeled set; without additional datasets or an analysis of how the three test rubrics relate to the distribution of dynamic rubrics seen in training, it is unclear whether the stability generalizes beyond the specific test styles.

Authors: We concur that reliance on a single human-labeled evaluation set limits the generalizability of the stability results. The three test rubrics were chosen to span distinct styles, but no explicit distributional comparison to the training dynamic rubrics was performed. We will revise to include such an analysis (e.g., feature-based comparison of criterion granularity and phrasing). Additional datasets would require new human annotations and are noted as a limitation for future work. revision: partial

standing simulated objections not resolved

Evaluation on only a single human-labeled dataset without results from additional datasets to support broader generalization claims.

Circularity Check

0 steps flagged

No circularity; empirical training and held-out evaluation

full rationale

The paper presents a standard supervised fine-tuning pipeline with a curriculum schedule on generated dynamic rubrics, followed by accuracy measurement on a held-out human-labeled test set under three fixed rubric prompts. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the described derivation. Results are obtained by direct comparison to external labels rather than any internal reduction or construction from the training inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised fine-tuning assumptions plus the untested premise that the generated dynamic rubrics capture real rubric variability; no free parameters, new entities, or ad-hoc axioms beyond ordinary ML training are introduced.

axioms (1)

standard math Standard assumptions of supervised fine-tuning and curriculum learning apply to safety judgment tasks
The training procedure follows typical LLM fine-tuning practices without additional justification in the abstract.

pith-pipeline@v0.9.1-grok · 5784 in / 1313 out tokens · 25650 ms · 2026-06-27T16:59:12.819911+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape
cs.CL 2026-06 unverdicted novelty 3.0

Rubrics function as explicit criteria sets that decompose judgments, supply dense training signals, and emerge from model behavior to bridge human intentions and LLM actions across evaluation, reinforcement learning, ...
From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape
cs.CL 2026-06 unverdicted novelty 3.0

The paper frames rubrics as a recurring structured-criteria approach that decomposes holistic judgments at evaluative, training, and intrinsic levels in LLM research.

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

arXiv:2404.01318. Eiras, F., Petrov, A., Torr, P. H. S., Kumar, M. P., and Bibi, A. Know thy judge: On the robustness meta- evaluation of LLM safety judges. InICLR 2025 Work- shop on I Can’t Believe It’s Not Better (ICBINB),

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Gemma Team, G

arXiv:2503.04474. Gemma Team, G. D. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page arXiv
[3]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M

arXiv:1904.03626. Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674,

work page arXiv 1904
[5]

Kim, S., Shin, J., Cho, Y ., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., and Seo, M

arXiv:2307.04657. Kim, S., Shin, J., Cho, Y ., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., and Seo, M. Prometheus: Inducing fine-grained evaluation capabil- ity in language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024a. arXiv:2310.08491. Kim, S., Suk, J., Longpre, S., Lin, B. Y ., Shin...

work page arXiv
[6]

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D

ICLR 2025 Workshop on Foundation Models in the Wild. Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning (ICML),

2025
[7]

arXiv:2402.04249. OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Training language models to follow instructions with human feedback

arXiv:2203.02155. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. DeepSeek- Math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

The Thirteenth International Conference on Learning Representations , url=

Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y ., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., and Mittal, P. SORRY-Bench: Systematically evaluating large language model safety re- fusal behaviors.arXiv preprint arXiv:2406.14598,

work page arXiv
[10]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Zeng, W., Liu, Y ., Mullins, R., Peran, L., Fernandez, J., Harkous, H., Narasimhan, K., Proud, D., Kumar, P., Radharapu, B., Sturman, O., and Wahltinez, O. Shield- Gemma: Generative AI content moderation based on Gemma.arXiv preprint arXiv:2407.21772,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

arXiv:2404.01318. Eiras, F., Petrov, A., Torr, P. H. S., Kumar, M. P., and Bibi, A. Know thy judge: On the robustness meta- evaluation of LLM safety judges. InICLR 2025 Work- shop on I Can’t Believe It’s Not Better (ICBINB),

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Gemma Team, G

arXiv:2503.04474. Gemma Team, G. D. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page arXiv

[3] [3]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M

arXiv:1904.03626. Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674,

work page arXiv 1904

[5] [5]

Kim, S., Shin, J., Cho, Y ., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., and Seo, M

arXiv:2307.04657. Kim, S., Shin, J., Cho, Y ., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., and Seo, M. Prometheus: Inducing fine-grained evaluation capabil- ity in language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024a. arXiv:2310.08491. Kim, S., Suk, J., Longpre, S., Lin, B. Y ., Shin...

work page arXiv

[6] [6]

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D

ICLR 2025 Workshop on Foundation Models in the Wild. Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning (ICML),

2025

[7] [7]

arXiv:2402.04249. OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Training language models to follow instructions with human feedback

arXiv:2203.02155. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. DeepSeek- Math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

The Thirteenth International Conference on Learning Representations , url=

Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y ., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., and Mittal, P. SORRY-Bench: Systematically evaluating large language model safety re- fusal behaviors.arXiv preprint arXiv:2406.14598,

work page arXiv

[10] [10]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Zeng, W., Liu, Y ., Mullins, R., Peran, L., Fernandez, J., Harkous, H., Narasimhan, K., Proud, D., Kumar, P., Radharapu, B., Sturman, O., and Wahltinez, O. Shield- Gemma: Generative AI content moderation based on Gemma.arXiv preprint arXiv:2407.21772,

work page internal anchor Pith review Pith/arXiv arXiv