arxiv: 2605.12022 · v1 · submitted 2026-05-12 · 💻 cs.CL

Recognition: no theorem link

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

Xiaoyuan Li , Yuzhe Wang , Moxin Li , Keqin Bao , Rui Men , Yichang Zhang , Dayiheng Liu , Wenjie Wang

show 1 more author

Fuli Feng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationrobustness augmentationautomated benchmark generationreinforcement learningvariant verificationknowledge benchmarksscalable evaluationfine-tuned models

0 comments

The pith

SAGE uses fine-tuned models to build large-scale robust LLM knowledge benchmarks at lower cost than human annotation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAGE as a framework to automatically generate and verify variants of knowledge questions that test the same facts in different forms. It trains a smaller verifier model called VariantQual on a limited set of human-labeled examples to judge quality, then uses that verifier as a reward signal to optimize a generator model called VariantGen first through supervised fine-tuning and then reinforcement learning. Experiments show this produces a large augmented benchmark for HellaSwag whose quality matches a human-annotated robust version but at far lower cost. The same fine-tuned models also improve performance on MMLU without any additional training specific to that benchmark.

Core claim

SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag show that SAGE constructs a large-scale robustness-augmented benchmark with quality comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models further generalize to MMLU without benchmark-specific fine-tuning.

What carries the argument

VariantQual, the rubric-based verifier trained on human seed data that serves as a reward model to guide reinforcement learning of the VariantGen generator for producing high-quality question variants.

If this is right

A large-scale robustness-augmented benchmark can be built for HellaSwag with quality matching human-annotated versions.
This construction happens at substantially lower cost than manual human annotation.
The fine-tuned models generalize their robustness improvements to other benchmarks such as MMLU without benchmark-specific fine-tuning.
The pipeline offers a scalable automated route to more reliable knowledge evaluation tests for LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verifier-guided generation process could be applied to augment benchmarks in domains other than knowledge evaluation.
Generalization to MMLU suggests the training teaches transferable handling of question variations rather than benchmark-specific tricks.
Ongoing application of the method could support dynamic updates to benchmarks as new model behaviors emerge.
Testing the full pipeline on additional knowledge benchmarks would confirm whether the cost and quality gains hold more broadly.

Load-bearing premise

The verifier trained on limited human seed data will keep giving accurate and unbiased quality judgments when used at large scale to create the full benchmark.

What would settle it

If the automatically generated benchmark shows different patterns of model performance drops on variants than the human-annotated version, or if the fine-tuned models fail to generalize to MMLU, the central claims would not hold.

Figures

Figures reproduced from arXiv: 2605.12022 by Dayiheng Liu, Fuli Feng, Keqin Bao, Moxin Li, Rui Men, Wenjie Wang, Xiaoyuan Li, Yichang Zhang, Yuzhe Wang.

**Figure 2.** Figure 2: Normalized per-variant-type contribution to RLA across LLMs. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Case study of SAGE-generated variants from MMLU. The original question and 7 generated [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Human verification accuracy of SAGE-generated variants. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: VariantGen base prompt template. The same template is used for all variant types. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Perturbation instructions for VariantGen. Causal Inference is shown as an example. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: VariantQual implicit rubric prompt template. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: VariantQual implicit explanation prompt template. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: VariantQual explicit rubric prompt template. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

read the original abstract

Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is therefore necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification. We propose SAGE (Scalable Automated Generation of Robustness BEnchmarks), a framework for scalable robustness augmentation of knowledge evaluation benchmarks using fine-tuned smaller models. SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag show that SAGE constructs a large-scale robustness-augmented benchmark with quality comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models further generalize to MMLU without benchmark-specific fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE gives a practical pipeline for scaling robustness augmentation on benchmarks like HellaSwag by training smaller models for variant generation and rubric-based verification, but the quality claims need more concrete evidence to hold up.

read the letter

The core contribution is an end-to-end setup that starts with human-labeled seed data to train VariantQual as a rubric verifier, then uses it to reward RL on VariantGen after an SFT stage. This produces large sets of question variants for existing knowledge benchmarks at lower cost than full human annotation or heavy LLM prompting loops. The paper reports that the resulting HellaSwag augmentation reaches quality close to the human-made HellaSwag-Pro version and that the fine-tuned models transfer to MMLU without further tuning on that benchmark. That combination of smaller-model efficiency and the two-component design is the actual new piece relative to earlier generate-then-verify work. It directly tackles the cost and yield problems that have limited robustness testing so far. The approach is straightforward enough that groups already doing benchmark work could replicate the training steps without major new infrastructure. The main weakness is that the abstract and available description give almost no numbers on how quality was measured—no agreement rates with humans, no baseline comparisons, no statistical details on the final set. Without those, the claim that the output matches human annotation stays hard to assess. The reward-hacking risk is also real here: any systematic leniency or blind spot in VariantQual can be exploited during RL, producing high-reward variants that still fail independent human checks. If the full paper includes strong human evaluations on the generated data or ablation checks on the verifier, that would close the gap; otherwise the central result rests on an unverified assumption. This is aimed at people who build or maintain LLM knowledge evaluations and need cheaper ways to add robustness checks. A reader running their own benchmark augmentation experiments would get concrete pipeline details worth trying. I would send it to peer review because the problem is practical, the method is described at a usable level, and the experiments can be checked once the metrics are filled in.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SAGE, a framework for scalable automated robustness augmentation of LLM knowledge evaluation benchmarks. SAGE consists of VariantQual, a rubric-based verifier trained on limited human-labeled seed data, and VariantGen, a variant generator initialized via supervised fine-tuning and further optimized via reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag claim that SAGE produces a large-scale robustness-augmented benchmark whose quality is comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models generalize to MMLU without benchmark-specific fine-tuning.

Significance. If the quality-comparability claim is substantiated, the work would be significant for enabling cost-effective construction of large robust evaluation sets that address brittleness in LLM knowledge assessment. The approach of using a learned verifier as an RL reward for generation is a practical scaling strategy, though its validity hinges on the verifier's reliability.

major comments (2)

[Experiments on HellaSwag] The headline claim that the SAGE-augmented benchmark has quality comparable to human-annotated HellaSwag-Pro is load-bearing and rests on VariantQual providing accurate, unbiased judgments at scale. The manuscript must report concrete human evaluation results on the final generated set (e.g., agreement rates with human annotators, precision/recall of VariantQual on held-out seed data, and side-by-side quality ratings versus HellaSwag-Pro), including statistical details and baselines; without these the central claim cannot be assessed.
[VariantGen optimization] Using VariantQual as the RL reward for VariantGen creates a risk of reward hacking, where generated variants achieve high verifier scores yet fail human quality standards. The paper should include targeted analysis (e.g., human judgments on a sample of high-reward variants, ablation of the RL stage, or detection of systematic blind spots in VariantQual) to rule out this failure mode; otherwise the cost-saving and generalization claims are undermined.

minor comments (2)

[Abstract] The abstract states experimental outcomes but omits all quantitative metrics, cost figures, and quality-comparison details; adding these would strengthen the summary.
Clarify the exact rubric used by VariantQual and the size/composition of the human-labeled seed data to allow reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript. We appreciate the referee's emphasis on strengthening the empirical validation of our claims. We address each major comment below and have updated the manuscript accordingly.

read point-by-point responses

Referee: [Experiments on HellaSwag] The headline claim that the SAGE-augmented benchmark has quality comparable to human-annotated HellaSwag-Pro is load-bearing and rests on VariantQual providing accurate, unbiased judgments at scale. The manuscript must report concrete human evaluation results on the final generated set (e.g., agreement rates with human annotators, precision/recall of VariantQual on held-out seed data, and side-by-side quality ratings versus HellaSwag-Pro), including statistical details and baselines; without these the central claim cannot be assessed.

Authors: We agree that the central claim requires concrete human evaluation results on the final generated set to be properly assessed. In the revised manuscript, we have added a new human evaluation subsection reporting agreement rates with human annotators, precision and recall of VariantQual on held-out seed data, side-by-side quality ratings versus HellaSwag-Pro, including statistical details and baselines. These results support the quality comparability of the SAGE-augmented benchmark to HellaSwag-Pro. revision: yes
Referee: [VariantGen optimization] Using VariantQual as the RL reward for VariantGen creates a risk of reward hacking, where generated variants achieve high verifier scores yet fail human quality standards. The paper should include targeted analysis (e.g., human judgments on a sample of high-reward variants, ablation of the RL stage, or detection of systematic blind spots in VariantQual) to rule out this failure mode; otherwise the cost-saving and generalization claims are undermined.

Authors: We recognize the potential issue of reward hacking in the RL optimization of VariantGen. To address this, the revised manuscript now includes targeted analyses: human judgments on samples of high-reward variants, an ablation study of the RL stage, and an examination for systematic blind spots in VariantQual. These additions demonstrate that the optimization does not lead to the described failure mode and bolster the cost-saving and generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in external human data

full rationale

The paper trains VariantQual on external human-labeled seed data and validates the generated benchmark's quality by direct comparison to the independently human-annotated HellaSwag-Pro. RL optimization of VariantGen uses VariantQual as reward, but the final claims rest on external benchmarks (HellaSwag-Pro, MMLU) rather than self-referential definitions or fitted parameters renamed as predictions. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The chain is self-contained against external human judgments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of two newly introduced model components trained from limited human seed data; the approach assumes the learned verifier generalizes reliably as a reward signal without introducing unmeasured biases.

axioms (1)

domain assumption Human-labeled seed data provides sufficient and unbiased ground truth for training a scalable variant quality verifier
Invoked when training VariantQual on seed examples to serve as reward model.

invented entities (2)

VariantQual no independent evidence
purpose: Rubric-based verifier model that judges quality of generated question variants
New component introduced to replace unreliable LLM verification; no independent evidence provided beyond training on seed data.
VariantGen no independent evidence
purpose: Generator model that produces robustness variants, optimized via RL using the verifier
New component introduced to scale variant generation; effectiveness claimed via experiments but not independently verified.

pith-pipeline@v0.9.0 · 5504 in / 1423 out tokens · 44571 ms · 2026-05-13T04:51:36.259923+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 8 internal anchors

[1]

Communications of the ACM , volume=

Commonsense reasoning and commonsense knowledge in artificial intelligence , author=. Communications of the ACM , volume=. 2015 , publisher=

work page 2015
[2]

2011 IEEE 11th International Conference on Data Mining Workshops , pages=

Isanette: A common and common sense knowledge base for opinion mining , author=. 2011 IEEE 11th International Conference on Data Mining Workshops , pages=. 2011 , organization=

work page 2011
[3]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Hellaswag: Can a machine really finish your sentence? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

work page
[4]

Commonsenseqa: A question answering challenge targeting commonsense knowledge , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

work page 2019
[5]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[6]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

Can a suit of armor conduct electricity? a new dataset for open book question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

work page 2018
[7]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Benchmarking chinese commonsense reasoning of llms: From chinese-specifics to reasoning-memorization correlations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[8]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Hellaswag-pro: A large-scale bilingual benchmark for evaluating the robustness of llms in commonsense reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[9]

Proceedings of the 2017 conference on empirical methods in natural language processing , pages=

Adversarial examples for evaluating reading comprehension systems , author=. Proceedings of the 2017 conference on empirical methods in natural language processing , pages=

work page 2017
[10]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Shortcutted commonsense: Data spuriousness in deep learning of commonsense reasoning , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[11]

Universal adversarial triggers for attacking and analyzing NLP , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019
[12]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

work page 2020
[13]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

It’s not easy being wrong: Large language models struggle with process of elimination reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[14]

Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

Beyond the tip of the iceberg: Assessing coherence of text classifiers , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

work page 2021
[15]

Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages=

How Much Consistency Is Your Accuracy Worth? , author=. Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages=

work page
[16]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Does self-rationalization improve robustness to spurious correlations? , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2022
[17]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

work page
[18]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

work page internal anchor Pith review arXiv
[19]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Distilling script knowledge from large language models for constrained language planning , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[20]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[22]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Rubrics as rewards: Reinforcement learning beyond verifiable domains , author=. arXiv preprint arXiv:2507.17746 , year=

work page internal anchor Pith review arXiv
[25]

Qwen3 Technical Report

Qwen Team , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.09388 , eprinttype =. 2505.09388 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[26]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

2024 , howpublished =

Anthropic , title =. 2024 , howpublished =

work page 2024
[28]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Zenodo , year=

A framework for few-shot language model evaluation , author=. Zenodo , year=

work page
[31]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Exploring Reversal Mathematical Reasoning Ability for Large Language Models , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[32]

The Thirteenth International Conference on Learning Representations,

Kaijing Ma and Xeron Du and Yunran Wang and Haoran Zhang and Zhoufutu Wen and Xingwei Qu and Jian Yang and Jiaheng Liu and Minghao Liu and Xiang Yue and Wenhao Huang and Ge Zhang , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

work page 2025
[33]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Back to the future: Unsupervised backprop-based decoding for counterfactual and abductive commonsense reasoning , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2020
[34]

Counterfactual story reasoning and generation , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

work page 2019
[35]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024
[36]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Say what you mean! large language models speak too positively about negative commonsense knowledge , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[37]

Affective domain , year=

Taxonomy of educational objectives , author=. Affective domain , year=

work page
[38]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[39]

9th International Conference on Learning Representations,

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations,. 2021 , url =

work page 2021
[40]

going on a vacation

“going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

work page 2019
[41]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Rica: Evaluating robust inference capabilities based on commonsense axioms , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[42]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

CRoW: Benchmarking commonsense reasoning in real-world tasks , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[43]

Social IQa: Commonsense reasoning about social interactions , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019
[44]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Evaluating mathematical reasoning of large language models: A focus on error identification and correction , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[45]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations) , pages=

Llamafactory: Unified efficient fine-tuning of 100+ language models , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations) , pages=

work page
[46]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Swift: a scalable lightweight infrastructure for fine-tuning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page