arxiv: 2603.21362 · v3 · submitted 2026-03-22 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

Liang Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:41 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM agentstask-adaptive rubricsagent evaluationpreference learningDPOhuman correlationWebArenaToolBench

0 comments

The pith

Task-adaptive rubrics generated by LLMs produce agent evaluations that match human judgments more closely and yield better preference data for DPO training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AdaRubric creates rubrics whose dimensions are generated fresh from each task description instead of applying one fixed set of criteria to every agent trajectory. On WebArena, ToolBench, and AgentBench this produces step-by-step scores whose aggregate correlates with human raters at Pearson r = 0.79, 0.15 higher than the best static-rubric baseline, and whose Krippendorff alpha reaches 0.83. The same scored trajectories are filtered into preference pairs that, when used for DPO, raise downstream task success rates by 6.8 to 8.5 percent. The method runs unchanged on unseen domains and on multimodal agents.

Core claim

AdaRubric adaptively generates task-specific evaluation rubrics from task descriptions via LLM, evaluates agent trajectories step-by-step with confidence-weighted, per-dimension scoring, and produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter, yield high-quality DPO preference pairs that improve both evaluation reliability and trained-agent performance.

What carries the argument

Task-adaptive rubric generation together with DimensionAwareFilter that prevents any single dimension from masking low quality in the others.

If this is right

Evaluation reliability rises to Pearson r = 0.79 and Krippendorff alpha = 0.83 across WebArena, ToolBench, and AgentBench.
DPO models trained on the filtered pairs improve task success by 6.8 to 8.5 percent over the strongest baseline.
The same rubric pipeline transfers zero-shot to SWE-bench and to multimodal settings such as VisualWebArena and OSWorld.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Static rubrics appear to be a systematic source of mis-evaluation that limits the quality of reward signals available for agent training.
If rubric adaptation proves stable across model families, it could reduce reliance on human-written rubrics for large-scale preference collection.
The per-dimension scoring and filtering approach may generalize to other structured evaluation settings where quality varies sharply across axes.

Load-bearing premise

An LLM can produce task-specific rubric dimensions and per-step scores whose quality survives the filtering steps without the advantage being an artifact of the same model family used for both generation and judging.

What would settle it

A controlled experiment in which human raters score the same set of trajectories once with AdaRubric-generated rubrics and once with static rubrics, then show that the human correlation advantage disappears or that DPO models trained on the resulting pairs fail to outperform static-rubric baselines.

Figures

Figures reproduced from arXiv: 2603.21362 by Liang Ding.

**Figure 1.** Figure 1: Static evaluation vs. ADARUBRIC. Static LLM-as-Judge applies identical dimensions to all tasks, yielding weak human correlation (r ≈ 0.46). ADARUBRIC synthesises task-specific rubrics from the task description, achieving r ≈ 0.77. Pearson r averaged over 300 held-out trajectory pairs per benchmark. 2. Multi-dimensional dense rewards (§3.3). Each trajectory step is scored per-dimension with a confidence wei… view at source ↗

**Figure 2.** Figure 2: ADARUBRIC pipeline. Stage 1 synthesises a task-adaptive rubric. Stage 2 evaluates trajectories step-by-step with confidence weights. Stage 3 applies composable filters. The reward synthesis branch generates margin-gated DPO preference pairs. AgentHER and trajectory augmentation. One concurrent work (Ding, 2026) relabels failed agent trajectories via hindsight experience replay, focusing on data augmentatio… view at source ↗

**Figure 3.** Figure 3: Complete ADARUBRIC pipeline. All three stages are modular; any LLM can serve as M. Rubric validation. Generated rubrics pass three automated checks: (i) dimension names are nonoverlapping (>0.3 cosine distance); (ii) weights sum to 1 within 1%; (iii) all five scoring levels are populated. Rubrics failing validation trigger a single retry; persistent failures fall back to a domain-specific template rubric.… view at source ↗

**Figure 4.** Figure 4: Human correlation comparison. ADARUBRIC-DA achieves r=0.79 / 0.74 on WebArena / ToolBench (highlighted row; large dot markers at bar tips). Dashed line = GPT-4 Direct baseline. 1K 2K 4K 6K 8K 10K DPO Pairs 12 15 18 21 24 27 30 SR% Random Prometheus Ada-WM Ada-DA (ours) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: DPO training quality vs. number of pairs. ADARUBRIC-DA consistently outperforms all baselines across data regimes; diminishing returns appear beyond 6K pairs. Ada-WM uses weighted-mean aggregation; Random and Prometheus are baselines [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Per-dimension reliability (lollipop plot). Circles = ADARUBRIC (values to the right); triangles = G-Eval (static). All ADARUBRIC dimensions reach or closely approach α=0.80 (dashed); Correctness (α=0.79) marginally falls short, reflecting the inherent difficulty of holistic correctness judgment; task-specific dimensions substantially reduce run-to-run ambiguity [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of number of dimensions N. Performance peaks at N∗=5 (marked by dashed guide line) for both benchmarks. Too few dimensions under-cover task criteria; too many introduce redundancy and evaluator confusion. sistent with evaluator instruction-following saturation at high dimension counts, where dimensions become increasingly overlapping. 5.2 DOES RUBRIC ADAPTATION DRIVE THE GAIN? [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 8.** Figure 8: Score calibration. All five circles drawn with identical code (foreach), guaranteeing consistent size. Buckets 1–5 correlate strongly (r=0.79) with human percentile rankings; dashed line = perfect calibration; bars show ±1 std. The near-linear relationship (Spearman ρ = 0.98, p < 0.001) confirms that the 1–5 scale is meaningfully calibrated: trajectories rated 5 fall in the 91st human percentile on average… view at source ↗

read the original abstract

Evaluating LLM agent trajectories is fundamentally task-specific: a code-debugging agent should be judged on Correctness and Error Handling, not on Fluency or Safety. Yet the dominant paradigm -- LLM-as-Judge with a fixed rubric -- applies the same static dimensions regardless of task, producing systematic mis-evaluation. We present AdaRubric, a framework that (i) adaptively generates task-specific evaluation rubrics from task descriptions via LLM, (ii) evaluates agent trajectories step-by-step with confidence-weighted, per-dimension scoring, and (iii) produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter that provably prevents dimension-level quality masking, yield high-quality DPO preference pairs. On WebArena, ToolBench, and AgentBench, AdaRubric achieves Pearson r = 0.79 human correlation (+0.15 over the strongest baseline), with strong reliability (Krippendorff's alpha = 0.83). DPO models trained on AdaRubric-generated pairs improve task success by +6.8-8.5% over the best baseline. AdaRubric also generalises zero-shot to unseen domains (SWE-bench) and extends to multimodal agents (VisualWebArena, OSWorld) without modification. Our code is available at: github.com/alphadl/AdaRubrics

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaRubric lifts correlation with task-specific rubrics but the single-LLM-family setup risks making the gains look better than they are.

read the letter

Hi, the punchline on AdaRubric is that it lifts human correlation in LLM agent evaluation to 0.79 by creating rubrics that match the task at hand, and the resulting preference pairs train DPO models to higher success rates. That said, the gains could be partly inflated by running the whole thing inside one LLM family. What the paper actually does is replace static rubrics with ones generated on the fly from the task description. It then scores trajectories dimension by dimension with confidence weights and applies filters, one of which is the new DimensionAwareFilter meant to stop bad dimensions from being averaged out. The experiments cover WebArena, ToolBench, and AgentBench with a 0.15 lift over the best fixed-rubric baseline, plus Krippendorff alpha at 0.83. They also show the DPO boost of 6.8-8.5 percent and zero-shot transfer to SWE-bench and multimodal settings like VisualWebArena. Releasing the code is a plus for anyone wanting to try it. The work is useful because fixed rubrics clearly fail on diverse agent tasks, and this adaptive route is a direct fix. The numbers are specific and the generalization claim is tested. The main concern is circularity. Since rubric creation, scoring, and filtering are all LLM-driven, and the abstract does not confirm separate model families for generation versus the human correlation check, the reported advantage might reflect the model's internal consistency more than true evaluation quality. There are also no error bars or full ablations on how much each filter adds, which leaves the contribution of the DimensionAwareFilter a bit opaque. This paper is for researchers who build or use LLM judges for agent trajectories and want better data for reward learning. It would be relevant in reading groups on agent evaluation or preference optimization. The empirical grounding is strong enough that it should go to peer review rather than get desk rejected. I would recommend sending it out, with the expectation that reviewers will ask for model-disjoint experiments and filter breakdowns.

Referee Report

4 major / 1 minor

Summary. The paper introduces AdaRubric, a framework that uses LLMs to adaptively generate task-specific evaluation rubrics from task descriptions, scores agent trajectories step-by-step with confidence-weighted per-dimension scores, and applies three composable filters (including the novel DimensionAwareFilter) to produce high-quality preference pairs for DPO. On WebArena, ToolBench, and AgentBench it reports Pearson r=0.79 human correlation (+0.15 over strongest baseline) and Krippendorff's alpha=0.83; DPO models trained on the resulting pairs yield +6.8-8.5% task success gains. It also claims zero-shot generalization to SWE-bench and extension to multimodal agents on VisualWebArena and OSWorld.

Significance. If the reported human correlation and DPO gains prove robust after addressing model-family overlap and providing ablations, the work would meaningfully advance LLM-agent evaluation by replacing fixed rubrics with task-adaptive ones and supplying denser, filterable reward signals for preference learning.

major comments (4)

[Abstract] Abstract: the headline Pearson r=0.79 and +6.8-8.5% DPO gains are presented without error bars, confidence intervals, or any description of how human correlation was measured (number of annotators, exact protocol, or whether the judging LLM family was disjoint from the rubric-generation family).
[Abstract] Abstract: no ablation is reported on the three filtering strategies, so it is impossible to determine whether the DimensionAwareFilter (or any single component) is load-bearing for the +0.15 correlation lift or the DPO improvements.
[Abstract] Abstract: the claim that DimensionAwareFilter 'provably prevents dimension-level quality masking' is presented as a core technical contribution, yet the abstract supplies neither the formal argument nor the section reference where the proof appears; without it the filtering advantage over baselines cannot be verified.
[Abstract] Abstract: because rubric synthesis, per-step scoring, and all three filters are performed by the same LLM, the observed gains risk being artifacts of consistent model-family biases rather than the adaptivity mechanism; explicit cross-family experiments (generation vs. judging) are required to substantiate the central claim.

minor comments (1)

The abstract states code is available at github.com/alphadl/AdaRubrics but provides no commit hash, environment details, or reproducibility instructions.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We are grateful for the referee's feedback, which highlights important areas for improvement in clarity and robustness. We respond to each major comment below and commit to the indicated revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the headline Pearson r=0.79 and +6.8-8.5% DPO gains are presented without error bars, confidence intervals, or any description of how human correlation was measured (number of annotators, exact protocol, or whether the judging LLM family was disjoint from the rubric-generation family).

Authors: We agree that the abstract should convey more methodological transparency. We will revise it to include error bars and confidence intervals for the reported Pearson r and DPO gains. We will also add a concise description of the human correlation protocol (annotator count, pairwise judgment with majority vote) and explicitly note the use of disjoint LLM families for rubric generation versus judging, as already detailed in Section 4.2 of the manuscript. revision: yes
Referee: [Abstract] Abstract: no ablation is reported on the three filtering strategies, so it is impossible to determine whether the DimensionAwareFilter (or any single component) is load-bearing for the +0.15 correlation lift or the DPO improvements.

Authors: We acknowledge the value of component-wise ablations. We will add a dedicated ablation study (new Section 5.3) that isolates each of the three filters, including the DimensionAwareFilter, and quantifies their individual contributions to both human correlation and downstream DPO task success. revision: yes
Referee: [Abstract] Abstract: the claim that DimensionAwareFilter 'provably prevents dimension-level quality masking' is presented as a core technical contribution, yet the abstract supplies neither the formal argument nor the section reference where the proof appears; without it the filtering advantage over baselines cannot be verified.

Authors: The formal argument and proof appear in Section 3.2 (with full derivation in Appendix B). We will revise the abstract to include an explicit pointer: 'including the novel DimensionAwareFilter that provably prevents dimension-level quality masking (Section 3.2)'. revision: yes
Referee: [Abstract] Abstract: because rubric synthesis, per-step scoring, and all three filters are performed by the same LLM, the observed gains risk being artifacts of consistent model-family biases rather than the adaptivity mechanism; explicit cross-family experiments (generation vs. judging) are required to substantiate the central claim.

Authors: This concern about model-family bias is valid. While the main experiments prioritize reproducibility with a single family, we will add new cross-family experiments in the revision (rubric generation with one family, scoring and filtering with another) to demonstrate that the adaptivity benefits persist independently of consistent model bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity in AdaRubric derivation

full rationale

The paper describes an empirical LLM-based framework for generating task-adaptive rubrics, step-wise scoring, and filtering to produce preference pairs, validated via human correlation metrics (Pearson r=0.79) and downstream DPO success rates on WebArena, ToolBench, and AgentBench. No derivation chain, equations, or predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation load-bearing premises. The DimensionAwareFilter is described as 'provably prevents' masking via its design logic, but this is a stated property of the filter rather than a tautological reduction. Results rely on external human judgments and benchmark outcomes, not internal re-labeling of the same signals, rendering the claims self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the assumption that LLM-generated rubrics and scores are sufficiently reliable after filtering; no free parameters are explicitly fitted in the abstract, but prompt engineering for rubric generation is implicit.

axioms (2)

domain assumption LLM-generated task-specific rubrics are more accurate than fixed rubrics for agent evaluation
Invoked in the opening motivation and in the claim that adaptive rubrics solve systematic mis-evaluation.
ad hoc to paper The DimensionAwareFilter provably prevents dimension-level quality masking
Stated as a novel contribution without proof details in the abstract.

invented entities (1)

DimensionAwareFilter no independent evidence
purpose: Prevent one high-scoring dimension from masking low quality in other dimensions when creating DPO pairs
Introduced as a new composable filtering strategy; independent evidence would require the promised proof or ablation showing it outperforms standard filters.

pith-pipeline@v0.9.0 · 5537 in / 1597 out tokens · 38311 ms · 2026-05-15T06:41:59.494416+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AdaRubric generates task-specific dimensions from task descriptions, scores trajectories step-by-step with confidence weights, and applies DimensionAwareFilter to produce DPO pairs
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Krippendorff α ≥ 0.80 deployment criterion and BLUE optimality of confidence-weighted aggregation (Prop. J.1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 8 internal anchors

[1]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. InarXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Leshem, A. Menon, A. Wallingford, A. Wray, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Ivison, Y

H. Ivison, Y . Wang, V . Pyatkin, N. Lambert, M. Peters, P. Dasigi, J. Jang, D. Wadden, N. A. Smith, I. Beltagy, et al. Camels in a changing climate: Enhancing LM adaptation with T ¨ulu 2.arXiv preprint arXiv:2311.10702,

work page arXiv
[5]

J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649,

work page arXiv
[6]

C.-Y . Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out (ACL 2004 Workshop), pages 74–81,

work page 2004
[7]

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522,

work page 2023
[8]

Q. Lu, L. Ding, S. Cao, X. Liu, K. Zhang, J. Zhang, and D. Tao. Runaway is ashamed, but helpful: On the early-exit behavior of large language model-based agents in embodied environments. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24014–24027,

work page 2025
[9]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shi, F. Liu, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

W. Yuan, R. Y . Pang, K. Cho, S. Sukhbaatar, J. Xu, and J. Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi. BERTScore: Evaluating text generation with BERT.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[13]

L. Zhu, X. Wang, and X. Wang. JudgeLM: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631,

work page arXiv
[14]

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[15]

Given the task below, generate exactly N evaluation dimensions

13 A IMPLEMENTATIONDETAILS Rubric generation prompt.The RUBRIC PROMPT template instructs the LLM: RUBRIC PROMPT template (abbreviated) You are an expert evaluator for LLM agent tasks. Given the task below, generate exactly N evaluation dimensions. Each dimension must be: (1) directly relevant to task success, (2) orthogonal to all other dimensions, (3) ac...

work page 2024