arxiv: 2604.20051 · v2 · submitted 2026-04-21 · 💻 cs.CL · cs.LG

Recognition: unknown

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

Chengyu Huang, Claire Cardie, Sheng-Yen Chou, Zhengxin Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords self-playpost-traininglarge language modelsopen-ended tasksevaluation rubricspretraining textreinforcement learning

0 comments

The pith

Rubric-based self-play on pretraining text creates training signals that improve LLM performance on open-ended tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes extending self-play training beyond tasks with objective answers to open-ended ones by having the model itself generate tasks, responses, and scoring rubrics drawn from pretraining text. The resulting signals are then used to update the model through reinforcement learning. A sympathetic reader would care because the method sidesteps the high cost of collecting human-written examples or relying on stronger external models for supervision. If the approach holds, it would make post-training more accessible for realistic applications such as long-form question answering and creative generation.

Core claim

The paper claims that synthesizing evaluation rubrics together with each input-output pair, while anchoring generation in a content-rich pretraining corpus, produces reliable training signals for post-training on open-ended tasks. On the Qwen-2.5-7B model, this yields measurable gains for both the base pretrained checkpoint and its instruction-tuned version across long-form healthcare QA, creative writing, and instruction following.

What carries the argument

The POP framework, in which the same LLM generates input-output pairs plus rubrics from pretraining text and then uses the rubrics to score and reinforce its own outputs.

If this is right

Performance rises on open-ended tasks including healthcare QA, creative writing, and instruction following.
Both the raw pretrained model and the already-tuned model benefit from the same self-play procedure.
Grounding generation in pretraining text reduces reward hacking and prevents mode collapse.
Self-play becomes usable for realistic tasks that lack easy objective verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rubric-generation step could be applied to other generative models that possess large pretraining corpora.
Combining rubric self-play with existing preference optimization techniques might further stabilize training.
Diversity in outputs could be preserved longer than in standard RL if rubric criteria explicitly reward varied valid responses.

Load-bearing premise

LLM-synthesized rubrics supply consistent, non-hackable evaluation criteria for open-ended outputs when generation is anchored in pretraining text.

What would settle it

No improvement, or clear degradation, in task performance after applying the method compared with the base or instruction-tuned model would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.20051 by Chengyu Huang, Claire Cardie, Sheng-Yen Chou, Zhengxin Zhang.

**Figure 2.** Figure 2: Per-aspect results on HealthBench500. Left: Qwen-2.5-7B; Right: Qwen-2.5-7B-Inst. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Per-axis results on IFEval. Left: Qwen-2.5-7B; Right: Qwen-2.5-7B-Inst. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Response score distribution for healthcareQwen-2.5-7B dataset. Left: scores given by the model itself; Right: scores given by a stronger teacher model; We group the candidate responses by queries and compute the average and standard deviation of response scores for each query. We then partition the queries into 10 bins according to average response scores. x-axis is the average score for the queries in tha… view at source ↗

**Figure 5.** Figure 5: Ablation Results on HealthBench500. We remove or replace various components of POP. This includes (1) revoking access to pretraining text d when generating the rubric (Eval w/o D); (2) revoking our entire pipeline’s access to d altogether (w/o D); (3) replacing our pointwise rubric grader with a pairwise judge that compares two responses according to our rubric (w/ pairwise judge); (4) removing the ac… view at source ↗

**Figure 6.** Figure 6: Correlation between our rankings and gold rankings on the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: POP achieves consistent improvement over the reference models for both cases. Exceptions [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 7.** Figure 7: Qwen-2.5-7B Results on Creative Writing V3. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Qwen-2.5-7B-Inst Results on Creative Writing V3. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Statistics for Healthcare QAQwen-2.5-7B dataset. E.3 Creative Writing We show the statistics for our synthesized datasets CreativeWritingQwen-2.5-7B in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Statistics for Healthcare QAQwen-2.5-7B-Inst dataset 40% 1%9%1% 27% 2% 10% 2%1% 0%7% Setting and Worldbuilding Narrative Structure Character Development Other Dialogue Plot and Conflict Genre Writing Themes and Symbolism Voice and Style Cultural Context Creative Nonfiction (a) Query topics 23% 19% 19% 9% 3% 6% 5% 4% 1% 11% 1% Originality Relevance Clarity Character Engagement Narrative Language Theme Sett… view at source ↗

**Figure 11.** Figure 11: Statistics for Creative WritingQwen-2.5-7B dataset. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Statistics for Creative WritingQwen-2.5-7B-Inst dataset Original Name Abbreviated Name Comprehension and Information Extraction Information Extraction Mathematical and Quantitative Reasoning Quantitative Reasoning Logical and Critical Reasoning Logical Reasoning Technical Explanation and Procedural Instruction Procedural Instruction Legal, Ethical, and Policy Analysis Ethical Analysis Political, Historica… view at source ↗

**Figure 13.** Figure 13: Statistics for Instruction FollowingQwen-2.5-7B dataset. 18% 3% 27% 4% 8% 0% 15% 4% 0% 21% 0% Quantitative Reasoning Logical Reasoning Information Extraction Other Procedural Instruction Ethical Analysis Scientific Reasoning Political Context Cultural Analysis Behavioral Analysis Planning (a) Query topics 9% 40% 6%1% 10% 2% 4% 10% 5% 12% 0% Accuracy Relevance Other Clarity Originality Completeness Coheren… view at source ↗

**Figure 14.** Figure 14: Statistics for Instruction FollowingQwen-2.5-7B-Inst dataset 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Judge without Rubric System Prompt. Response Grading without Rubric User Prompt <knowledge> {knowledge} </knowledge> <problem> {question} </problem> <candidate_answer> {answer} </candidate_answer> Now evaluate and rate the candidate answer [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Judge without Rubric User Prompt. F Ablation Study We detail the methodology for each ablation setting in Appendix F.1. In Appendix F.2, since prior work [35] claims strong performance of pairwise judges over pointwise judges, we investigate the reason why it does not outperform our judge, using the same ranking analysis as in § 5.5. F.1 Methodology Eval w/o D. We use the same prompt for rubric generation… view at source ↗

**Figure 17.** Figure 17: Pairwise Judge System Prompt. Pairwise Response Grading User Prompt <problem> {question} </problem> <rubric> {rubric} </rubric> <candidate_answer_A> {answer_a} </candidate_answer_A> <candidate_answer_B> {answer_b} </candidate_answer_B> Now compare candidate answer A and candidate answer B and rate each of them [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Pairwise Judge User Prompt. then compute the relative score of response i as s i − s i anchor. The relative score is used to rank the responses and select the highest-scored response yw and lowest-scored response yl . To avoid length bias, we sort the responses according to length and pick the median-length response as the anchor. F.2 Response Ranking Analysis for Pairwise Judge %(rkgold yw ≤ rkgold yl ) … view at source ↗

**Figure 19.** Figure 19: Correlation between our rankings and gold rankings on the [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: Rubric Error Composition. We take the same 100 examples from the Healthcare QAQwen-2.5-7B dataset that are used to categorize queries and rubric criteria in Appendix E. In total, these examples have 453 rubric criteria. We ask GPT-4.1-mini to identify common mistakes in the rubric. In particular, we first ask the model to find the mistakes, if any, made by each criterion in the rubric, conditioned on th… view at source ↗

**Figure 21.** Figure 21: Response score distribution for healthcareQwen-2.5-7B dataset. Left: Histograms of scores from our model and the teacher model. Right: Box plot where the x-axis is our scores and the y-axis is the teacher’s scores on the same response, using the same rubric. with the pretraining document, implying either that the criteria have incorrect gold labels, or that the pretraining text itself is noisy. G.1.2 Grad… view at source ↗

**Figure 22.** Figure 22: Correlation with gold rankings on the HealthcareQwen-2.5-7B dataset. x-axis: Ranking of responses from our ablation settings. y-axis: For responses that are ranked at the top x% among the responses to the same question according to our model, the distribution of their gold rankings. Spearman’s r: Spearman’s ranking correlation. %(rkgold yw ≤ rkgold yl ) Ours 85.14 w/ teacher rubric 90.17 w/ teacher score … view at source ↗

**Figure 24.** Figure 24: Correlation with Human Evaluations. Our procedure is as follows: (1) We sample I = 4, 096 queries from the original HealthBench after excluding the examples in HealthBench500 and our validation set; (2) We use POP to create a new dataset with πref . We skip the query synthesis step and instead take the HealthBench queries. We also set the pretraining document d and reference answer yref to "None" since He… view at source ↗

**Figure 23.** Figure 23: Response Grading Prompt for HealthBench Rubrics [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 25.** Figure 25: Results with a strong teacher model. Surprisingly, none of the variants significantly surpass our unsupervised version. w/ teacher qus and w/ teacher eva only give similar performance to ours. For w/ teacher aus, we suspect that the relatively small training set size that we are using hinders the manifestation of the benefits of using better queries from stronger models, or it could be that the teacher’… view at source ↗

**Figure 26.** Figure 26: Results with alternative training approaches. [PITH_FULL_IMAGE:figures/full_fig_p031_26.png] view at source ↗

**Figure 27.** Figure 27: Pre-training Text (Healthcare QA). M.1 Healthcare QA See [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗

**Figure 28.** Figure 28: Query (Healthcare QA). Reference Response The potential mesophilic amidohydrolase enzymes that could degrade ac4C in humans include BagYqfB from Buttiauxella agrestis, CunYqfB from Cronobacter universalis, KpnYqfB from Klebsiella pneumoniae, and SloYqfB from Shewanella loihica. To test their activity, the following experiment could be designed: 1. Express and purify the recombinant amidohydrolase enzymes.… view at source ↗

**Figure 29.** Figure 29: Reference Response (Healthcare QA). Rubric { "Enzyme Identification": { "name": "Identification of Potential Enzymes", "gold": "BagYqfB, CunYqfB, KpnYqfB, and SloYqfB", "description": "The criterion is about correctly identifying the potential mesophilic amidohydrolase enzymes that could degrade ac4C. The ground truth includes BagYqfB from Buttiauxella agrestis, CunYqfB from Cronobacter universalis, KpnYq… view at source ↗

**Figure 30.** Figure 30: Rubric (Healthcare QA). Candidate Response yw The potential enzymes to degrade ac4C include E. coli amidohydrolase, human cytosolic amidohydrolase, and Pseudomonas putida amidohydrolase. To test their activity, synthesize ac4C substrate, express and incubate with these enzymes under optimal conditions. Use HPLC/MS for degradation assessment, ensuring thorough purification steps. In addition to assessing r… view at source ↗

**Figure 31.** Figure 31: Response yw (Healthcare QA). 33 [PITH_FULL_IMAGE:figures/full_fig_p033_31.png] view at source ↗

**Figure 32.** Figure 32: Response yw Grading (Healthcare QA). Candidate Response yl The potential mesophilic amidohydrolase enzymes that could degrade ac4C include N-acetylneuraminate lyase, N-acetylmuramide glycanase 2 (MurG), and N-acetylmuramide amidohydrolase (NAM-AH). An experimental design to test their activity would involve purifying these enzymes, preparing synthetic ac4C, conducting the reaction under controlled conditi… view at source ↗

**Figure 33.** Figure 33: Response yl (Healthcare QA). Response yl Grading [ { "name": "Identification of Potential Enzymes", "rating": 0.0, "thoughts": "The candidate answer mentions N-acetylneuraminate lyase, N-acetylmuramide glycanase 2 (MurG), and N-acetylmuramide amidohydrolase (NAM-AH) instead of the correct enzymes BagYqfB, CunYqfB, KpnYqfB, and SloYqfB. These are not the correct enzymes mentioned in the ground truth." }, {… view at source ↗

**Figure 34.** Figure 34: Response yl Grading (Healthcare QA). 34 [PITH_FULL_IMAGE:figures/full_fig_p034_34.png] view at source ↗

**Figure 35.** Figure 35: Pre-training Text (Creative Writing). Query At the onset of World War II, Theodore “Ted” Roosevelt Jr. saw an opportunity to distinguish himself further and break away from the shadow of his father's fame. Despite his age and physical ailments, Ted insisted on active service. Write a scene where Ted, now in his mid-fifties and with a limp from a previous injury, discusses his decision with General Omar Br… view at source ↗

**Figure 36.** Figure 36: Query (Creative Writing). M.2 Creative Writing See [PITH_FULL_IMAGE:figures/full_fig_p035_36.png] view at source ↗

**Figure 37.** Figure 37: Reference Response (Creative Writing). 36 [PITH_FULL_IMAGE:figures/full_fig_p036_37.png] view at source ↗

**Figure 38.** Figure 38: Rubric (Creative Writing). 37 [PITH_FULL_IMAGE:figures/full_fig_p037_38.png] view at source ↗

**Figure 39.** Figure 39: Response yw (Creative Writing). 38 [PITH_FULL_IMAGE:figures/full_fig_p038_39.png] view at source ↗

**Figure 40.** Figure 40: Response yw Grading (Creative Writing). 39 [PITH_FULL_IMAGE:figures/full_fig_p039_40.png] view at source ↗

**Figure 41.** Figure 41: Response yl (Creative Writing). 40 [PITH_FULL_IMAGE:figures/full_fig_p040_41.png] view at source ↗

**Figure 42.** Figure 42: Response yl Grading (Creative Writing). 41 [PITH_FULL_IMAGE:figures/full_fig_p041_42.png] view at source ↗

**Figure 43.** Figure 43: Pre-training Text (Instruction Following). [PITH_FULL_IMAGE:figures/full_fig_p042_43.png] view at source ↗

**Figure 44.** Figure 44: Query (Instruction Following). Reference Response 1. President Rodrigo Duterte declared that he will not declare a revolutionary government. 2. He stated his new position by emphasizing the importance of building a country and lawfulness, saying that he would arrest individuals who pose a threat rather than use the declaration of a revolutionary government. He further explained that he did not intend to u… view at source ↗

**Figure 45.** Figure 45: Reference Response (Instruction Following). [PITH_FULL_IMAGE:figures/full_fig_p043_45.png] view at source ↗

**Figure 46.** Figure 46: Rubric (Instruction Following). Candidate Response yw 1. On November 21, 2017, President Rodrigo Duterte clarified that he was not going to declare a revolutionary government. He stated that he would continue to work within the framework of the Philippine constitution and existing governmental structures. 2. President Duterte justified this new position by asserting that declaring a revolutionary governme… view at source ↗

**Figure 47.** Figure 47: Response yw (Instruction Following). 43 [PITH_FULL_IMAGE:figures/full_fig_p043_47.png] view at source ↗

**Figure 48.** Figure 48: Response yw Grading (Instruction Following). Candidate Response yl 1. On November 21, 2017, President Duterte declared that his initial threat to declare a revolutionary government was a bluff and that he had no intention of actually implementing such a move. He stated that the declaration was made under duress due to threats from opposing groups and to protect national interest. 2. President Duterte stat… view at source ↗

**Figure 49.** Figure 49: Response yl (Instruction Following). Response yl Grading [ { "name": "Correctness of the first part of the answer", "rating": 1.0, "thoughts": "The candidate answer correctly states that President Duterte declared he would not declare a revolutionary government. It captures the essence of his clarification without adding incorrect details. However, it does not explicitly mention that the declaration was a… view at source ↗

**Figure 50.** Figure 50: Response yl Grading (Instruction Following). 44 [PITH_FULL_IMAGE:figures/full_fig_p044_50.png] view at source ↗

**Figure 51.** Figure 51: Query Synthesis Prompt (Healthcare QA). Query Synthesis System Prompt (Creative Writing) You are a problem proposer. Given a text sample from the description or main body of a book, create a challenging, well-structured, and unambiguous creative writing question. ## Guidelines * The question should be assessable. * The question should build upon the given text sample. In particular, to create the question… view at source ↗

**Figure 52.** Figure 52: Query Synthesis Prompt (Creative Writing). [PITH_FULL_IMAGE:figures/full_fig_p045_52.png] view at source ↗

**Figure 53.** Figure 53: Query Synthesis Prompt (Instruction Following). [PITH_FULL_IMAGE:figures/full_fig_p046_53.png] view at source ↗

**Figure 54.** Figure 54: Query Synthesis User Prompt. Query Synthesis User Prompt (Creative Writing) <text_sample> {knowledge} </text_sample> Now, please create a challenging, well-structured, and unambiguous creative writing question along with its reference answer, using the provided text sample as context [PITH_FULL_IMAGE:figures/full_fig_p046_54.png] view at source ↗

**Figure 55.** Figure 55: Query Synthesis User Prompt (Creative Writing). [PITH_FULL_IMAGE:figures/full_fig_p046_55.png] view at source ↗

**Figure 59.** Figure 59: Response Generation User Prompt. Rubric Generation System Prompt Given a piece of knowledge, a problem, its reference answer, and a set of candidate answers, generate a rubric to evaluate the candidate answers. It should consist of 1 to 5 criteria, where each criterion is accompanied by (1) a description, (2) a weight that indicates how important that criterion is to the overall quality, and (3) a gold st… view at source ↗

**Figure 60.** Figure 60: Rubric Generation System Prompt. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_60.png] view at source ↗

**Figure 61.** Figure 61: Rubric Generation User Prompt. Response Grading System Prompt Given a problem, an evaluation rubric, and a candidate answer, evaluate the candidate answer by giving the answer a rating for each criterion in the rubric. ## Format Put your rubric-based evaluation in <evaluation>...</evaluation> tags. ``` <evaluation> { "criterion_1": { "name": "(Name of criterion_1)", "thoughts": "(Your thoughts on how well… view at source ↗

**Figure 62.** Figure 62: Response Grading System Prompt. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_62.png] view at source ↗

**Figure 63.** Figure 63: Response Grading User Prompt. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_63.png] view at source ↗

read the original abstract

Self-play has recently emerged as a promising paradigm for post-training Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., a question), which it then addresses itself by producing a task output (e.g., an answer). A reward model evaluates the output, and the rewards are used to train the LLM, typically via Reinforcement Learning (RL). A key benefit of self-play for post-training LLMs is its minimal supervision costs: self-play avoids the need for high-quality input-output pairs traditionally constructed by humans or expensive proprietary models. Existing work, however, explores self-play only for verifiable tasks, such as math and coding, for which objective ground truth is available and easily checkable. In this paper, we seek to extend self-play to more realistic open-ended tasks. We propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics along with each input-output pair. The rubric is used to evaluate outputs and train the model. Crucially, we ground the framework on a content-rich pretraining corpus to (1) enable an exploitable generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both the pretrained base model and instruction-tuned model on multiple tasks ranging from long-form healthcare QA to creative writing and instruction following.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POP extends self-play to open-ended tasks with rubric synthesis from pretraining text, but the abstract shows no metrics or validation to back the gains.

read the letter

The paper introduces POP, a self-play setup where the LLM generates task inputs, outputs, and evaluation rubrics all from a pretraining corpus. The goal is to handle open-ended work like healthcare QA and creative writing without relying on human labels or verifiable answers. Prior self-play stayed in math and code because those have easy checks; this tries to move past that by using rubrics to create a usable reward signal while the pretraining anchor limits mode collapse and reward hacking. That framing is a direct response to a clear limitation in the existing literature. The approach is straightforward and the motivation is solid. They report that it lifts performance on Qwen-2.5-7B for both the base model and an instruction-tuned version across several tasks. If the full experiments include proper controls and the numbers are real, this would cut supervision costs for domains where objective verification is hard. The soft spot is the absence of any quantitative results, ablations, or rubric-quality checks in the abstract. The stress-test concern lands: since the same model makes the input, output, and rubric, the scoring criteria can still sit inside the model's current distribution. Pretraining grounding constrains the inputs but does not automatically make the rubrics reliable or non-hackable for subjective tasks. No mention of human validation, external rubric comparisons, or adversarial tests appears, so it is not yet clear whether the reported gains reflect capability growth or just rubric exploitation. This is for groups working on cheap post-training for open-ended LLM applications. It deserves peer review because the problem is practical and the high-level design is coherent, even if the current version needs the missing evidence filled in before it can be evaluated properly.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes POP, a self-play framework for post-training LLMs on open-ended tasks. The LLM generates input-output pairs grounded in a pretraining corpus along with synthesized evaluation rubrics; the rubrics supply rewards for RL training. The pretraining grounding is intended to create a generation-verification gap that reduces reward hacking and mode collapse. The authors claim that POP improves performance of both the base and instruction-tuned Qwen-2.5-7B on tasks including long-form healthcare QA, creative writing, and instruction following.

Significance. If the empirical gains are robust and the rubric signals prove reliable, the work would be significant for extending self-play beyond verifiable domains with minimal additional supervision. The pretraining-text grounding offers a concrete mechanism to address reward hacking and collapse, which are persistent obstacles in RL post-training of LLMs.

major comments (2)

Abstract: The central claim that POP increases performance on Qwen-2.5-7B across multiple open-ended tasks is asserted without any quantitative metrics, ablation results, standard deviations, or baseline comparisons. No details are supplied on how the generation-verification gap is measured or on rubric quality, leaving the empirical contribution unsupported by visible evidence.
Framework and evaluation sections: The claim that LLM-synthesized rubrics plus pretraining grounding produce a usable, non-hackable generation-verification gap is load-bearing for extending self-play to subjective tasks. No independent checks (human rubric validation, adversarial rubric attacks, or comparison to fixed external rubrics) are reported to confirm that the rubric scoring criteria are not implicitly aligned with the model's own distribution, especially for creative writing and long-form QA.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of presentation and empirical support that we will address in the revision. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: Abstract: The central claim that POP increases performance on Qwen-2.5-7B across multiple open-ended tasks is asserted without any quantitative metrics, ablation results, standard deviations, or baseline comparisons. No details are supplied on how the generation-verification gap is measured or on rubric quality, leaving the empirical contribution unsupported by visible evidence.

Authors: We agree that the abstract, as a concise summary, does not include specific numerical results or details on the gap and rubric assessment. The full manuscript presents these quantitative evaluations, ablations, baseline comparisons, and standard deviations from repeated runs in the Experiments section, along with the operationalization of the generation-verification gap through pretraining-text grounding in the Framework section. To make the empirical contribution more immediately apparent, we will revise the abstract to incorporate key performance highlights, baseline comparisons, and a brief description of how the gap is created and how rubric quality is assessed via consistency measures. revision: yes
Referee: Framework and evaluation sections: The claim that LLM-synthesized rubrics plus pretraining grounding produce a usable, non-hackable generation-verification gap is load-bearing for extending self-play to subjective tasks. No independent checks (human rubric validation, adversarial rubric attacks, or comparison to fixed external rubrics) are reported to confirm that the rubric scoring criteria are not implicitly aligned with the model's own distribution, especially for creative writing and long-form QA.

Authors: The pretraining grounding is intended to establish the generation-verification gap by anchoring both the synthesized inputs/outputs and rubrics in content from the pretraining corpus, which the model cannot arbitrarily alter without deviating from the grounded distribution; this mechanism is detailed in Section 3 and supported by the observed reductions in mode collapse in our experiments. We acknowledge that the manuscript does not report independent checks such as human rubric validation or adversarial attacks. We will revise the paper to explicitly discuss this as a limitation, provide additional theoretical justification for why the grounding reduces implicit alignment risks, and include any feasible preliminary consistency analyses in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical framework with external grounding

full rationale

The paper introduces POP as a new self-play method that synthesizes rubrics from the LLM and grounds generation on an external pretraining corpus. Performance gains are reported via experiments on Qwen-2.5-7B across tasks, not via any mathematical derivation or parameter fitting that reduces to the inputs by construction. No equations, self-definitional loops, or load-bearing self-citations appear in the provided text; the generation-verification gap is posited as a design choice supported by the corpus rather than assumed tautologically. This is a standard empirical proposal with no reduction of claims to their own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the framework implicitly rests on the assumption that self-generated rubrics are sufficiently accurate evaluators and that pretraining grounding prevents collapse, but no explicit free parameters, axioms, or invented entities are stated.

axioms (1)

domain assumption LLM-generated rubrics can serve as reliable proxies for evaluating open-ended outputs without introducing systematic bias or reward hacking.
Central to the evaluation and training loop described in the abstract.

pith-pipeline@v0.9.0 · 5564 in / 1334 out tokens · 115971 ms · 2026-05-10T01:45:35.720252+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 27 canonical work pages · 13 internal anchors

[1]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review arXiv 2025
[3]

Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K. Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning.arXiv preprint arXiv:2504.19162, 2025

work page arXiv 2025
[4]

Self- playing Adversarial Language Game Enhances LLM Reasoning,

Pengyu Cheng, Tianhao Hu, Han Xu, Zhisong Zhang, Zheng Yuan, Yong Dai, Lei Han, Nan Du, and Xiaolong Li. Self-playing adversarial language game enhances llm reasoning.arXiv preprint arXiv:2404.10642, 2024

work page arXiv 2024
[5]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping H...

work page internal anchor Pith review arXiv 2022
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint:arXiv:2512.02556, 2025

work page internal anchor Pith review arXiv 2025
[8]

Qa-lign: Aligning llms through constitutionally decomposed qa

Jacob Dineen, Aswin Rrv, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, and Ben Zhou. Qa-lign: Aligning llms through constitutionally decomposed qa. InProceedings of Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

2025
[9]

Openwebtext corpus

Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

2019
[10]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review arXiv 2025
[11]

Lighteval: A lightweight framework for llm evaluation, 2023

Nathan Habib, Clémentine Fourrier, Hynek Kydlí ˇcek, Thomas Wolf, and Lewis Tunstall. Lighteval: A lightweight framework for llm evaluation, 2023. URL https://github.com/ huggingface/lighteval

2023
[12]

Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, Shengjie Bi, Shishir G. Patil, Qi Qi, Shengyu Feng, Julian Katz-Samuels, Richard Yuanzhe Pang, Sujan Gonugondla, Hunter Lang, Yue Yu, Yundi Qian, Maryam Fazel-Zarandi, Licheng Yu, Amine Benhalloum, Hany Awadalla, and Manaal Fa...

work page arXiv 2025
[13]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Advances in Neural Information Processing Systems, 2021. 10

2021
[14]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review arXiv 2025
[15]

Dcrm: A heuristic to measure response pair quality in preference optimization

Chengyu Huang and Tanya Goyal. Dcrm: A heuristic to measure response pair quality in preference optimization. InFindings of the Association for Computational Linguistics: EMNLP, 2025

2025
[16]

Large language models can self-improve

Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023
[17]

Reinforcement learning with rubric anchors

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, and Junbo Zhao. Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790, 2025

work page arXiv 2025
[18]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension.arXiv e-prints, art. arXiv:1705.03551, 2017

work page internal anchor Pith review arXiv 2017
[19]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

2019
[20]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers), 2022

2022
[21]

Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

work page arXiv 2025
[22]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of International Conference on Learning Representations, 2019

2019
[23]

Advanced version of gemini with deep think officially achieves gold-medal standard at the international mathematical olympiad, 2025

Thang Luong and Edward Lockhart. Advanced version of gemini with deep think officially achieves gold-medal standard at the international mathematical olympiad, 2025

2025
[24]

Building trust in clinical llms: Bias analysis and dataset transparency

Svetlana Maslenkova, Clement Christophe, Marco AF Pimentel, Tathagata Raha, Muham- mad Umar Salman, Ahmed Al Mahrooqi, Avani Gupta, Shadab Khan, Ronnie Rajan, and Praveenkumar Kanithi. Building trust in clinical llms: Bias analysis and dataset transparency. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

2025
[25]

Eq-bench creative writing benchmark v3.https://github.com/EQ-bench/ creative-writing-bench, 2025

Samuel J Paech. Eq-bench creative writing benchmark v3.https://github.com/EQ-bench/ creative-writing-bench, 2025

2025
[26]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann, editors,Proceedings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learni...

2022
[27]

Agentic large language models: A survey.arXiv preprint arXiv:2503.23037, 2025

Aske Plaat, Max van Duijn, Niki van Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. Agentic large language models, a survey.arXiv preprint arXiv:2503.23037, 2025

work page arXiv 2025
[28]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InarXiv preprint arXiv:2305.18290, 2023. 11

work page internal anchor Pith review arXiv 2023
[29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98

2024
[30]

Sutherland

Yi Ren and Danica J. Sutherland. Learning dynamics of llm finetuning. InProceedings of International Conference on Learning Representations, 2025

2025
[31]

Karl: Knowledge agentsvial reinforcement learning.arXiv preprint:arXiv:2603.05218, 2026

Databricks AI Research. Karl: Knowledge agentsvial reinforcement learning.arXiv preprint:arXiv:2603.05218, 2026

work page arXiv 2026
[32]

suggested_labels

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons.arXiv preprint arXiv:2510.07284, 2025

work page arXiv 2025
[33]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. Dr tulu: Reinforcement learning with ev...

work page arXiv 2025
[35]

V1: Unifying generation and self-verification for parallel reasoners.arXiv preprint arXiv:2603.04304, 2026

Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, Rishabh Tiwari, Long Lian, Yucheng Lu, Boyi Li, Alane Suhr, Ben Athiwaratkun, and Kurt Keutzer. V1: Unifying generation and self-verification for parallel reasoners.arXiv preprint arXiv:2603.04304, 2026

work page arXiv 2026
[36]

Book titles and abstracts

Skelebor. Book titles and abstracts. https://huggingface.co/datasets/Skelebor/ book_titles_and_descriptions, 2022

2022
[37]

Mind the gap: Examining the self-improvement capabilities of large language models

Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large language models. InProceedings of International Conference on Learning Representations, 2025

2025
[38]

Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,

Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, and Will Dabney Yong Cheng. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448, 2024

work page arXiv 2024
[39]

GPT-4 Technical Report

OpenAI Team. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Will we run out of data? limits of llm scaling based on human-generated data

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data. In Proceedings of the 41st International Conference on Machine Learning, 2024

2024
[42]

Checklists are better than reward models for aligning language models

Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, and Tongshuang Wu. Checklists are better than reward models for aligning language models. In Advances in Neural Information Processing Systems, 2025

2025
[43]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, Datasets...

2024
[44]

Self-rewarding language models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. InProceedings of the 41st International Conference on Machine Learning, 2024. 12

2024
[45]

Better llm reasoning via dual-play.arXiv preprint arXiv:2511.11881, 2025

Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, and Claire Cardie. Better llm reasoning via dual-play.arXiv preprint arXiv:2511.11881, 2025

work page arXiv 2025
[46]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

work page internal anchor Pith review arXiv 2025
[47]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023

2023
[48]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Each sentence in the generated text uses a second person

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, Hengtong Lu, Wei Chen, Yan Xie, and Mingli Song. Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949, 2025

work page arXiv 2025
[50]

instruction

Yifei Zhou, Sergey Levine, Jason Weston, Xian Li, and Sainbayar Sukhbaatar. Self-challenging language model agents.arXiv preprint arXiv:2506.01716, 2025. 13 Stage Parameter Value Sampling Max Context Length 32768 |d| ∈[50,1024] Proposer Max New Tokens 6144 Proposer Temperature 1.0 Proposer Top P 1.0 Solver Max New Tokens 6144 Solver Temperature 1.0 Solver...

work page arXiv 2025
[51]

Express and purify the recombinant amidohydrolase enzymes
[52]

Conduct in vitro degradation assays using labeled ac4C substrates in buffer conditions mimicking human cellular environments
[53]

Analyze the degradation products using techniques such as HPLC or mass spectrometry
[54]

Enzyme Identification

Validate the results by comparing the degradation rates and efficiencies of the enzymes from different mesophilic organisms against each other and against a control without the enzyme. Figure 29: Reference Response (Healthcare QA). Rubric { "Enzyme Identification": { "name": "Identification of Potential Enzymes", "gold": "BagYqfB, CunYqfB, KpnYqfB, and Sl...

2017
[55]

What did President Duterte declare regarding his plans for a revolutionary government?
[56]

Figure 44: Query (Instruction Following)

Why did President Duterte state this new position? Use the details provided in the passage to clearly answer these questions. Figure 44: Query (Instruction Following). Reference Response
[57]

President Rodrigo Duterte declared that he will not declare a revolutionary government
[58]

criterion_1

He stated his new position by emphasizing the importance of building a country and lawfulness, saying that he would arrest individuals who pose a threat rather than use the declaration of a revolutionary government. He further explained that he did not intend to use a revolutionary government as a tool to crack down on destabilizers but instead preferred ...
[59]

He stated that he would continue to work within the framework of the Philippine constitution and existing governmental structures

On November 21, 2017, President Rodrigo Duterte clarified that he was not going to declare a revolutionary government. He stated that he would continue to work within the framework of the Philippine constitution and existing governmental structures

2017
[60]

name": "Correctness of the first part of the answer

President Duterte justified this new position by asserting that declaring a revolutionary government was unnecessary and potentially unconstitutional. He explained that the current system already provides ample avenues to address issues and challenges facing the country, such as through executive orders, legislative processes, and other constitutional mec...
[61]

He stated that the declaration was made under duress due to threats from opposing groups and to protect national interest

On November 21, 2017, President Duterte declared that his initial threat to declare a revolutionary government was a bluff and that he had no intention of actually implementing such a move. He stated that the declaration was made under duress due to threats from opposing groups and to protect national interest

2017
[62]

name": "Correctness of the first part of the answer

President Duterte stated this new position to clarify and de-escalate the tension related to his earlier statements. By clarifying that the threat was not genuine and that he had no plans to implement a revolutionary government, Duterte aimed to reduce political instability and reassure the public and international community that there were no immediate p...
[63]

Extract the list of facts in mentioned in the knowledge
[64]

Ensure that it is necessary and sufficient to use these facts to derive the correct answer to the problem

Create a problem that requires the use of one or more of the extracted facts to answer. Ensure that it is necessary and sufficient to use these facts to derive the correct answer to the problem. * The problem should ask about a health/medical-related question (pretend that you are a patient and seek health/medical advice; pretend that you are a healthcare...
[66]

* Provide a reference answer within <answer>...</answer> tags

While you should include the necessary context, you should not mention or include any part of the ground truth answer in the problem statement. * Provide a reference answer within <answer>...</answer> tags. The reference answer should be final and concise, excluding any raw thinking traces. Figure 51: Query Synthesis Prompt (Healthcare QA). Query Synthesi...
[67]

Extract the key entities, relations, and settings mentioned in the text sample
[68]

## Format * Enclose the question statement within <problem>...</problem> tags

Create a creative writing question (write a story, a scene or dialogue in a story, a narrative; expand or continue a story; construct or expand the setting of a story, etc.) that builds upon the text sample and the mentioned entities, relations, and settings. ## Format * Enclose the question statement within <problem>...</problem> tags
[69]

according to the text sample

In the question statement, provide the necessary context so that the question is unambiguous, self-contained, and standalone. Never use: "according to the text sample", "in the text sample", "as mentioned", "the text sample states", "based on the text sample", etc
[70]

While you should include the necessary context, you should not mention or include any reference answer in the question statement
[71]

Length: 1000 words

End your question statement with the length requirement "Length: 1000 words.". * Provide a reference answer within <answer>...</answer> tags. The reference answer should exclude any raw thinking traces. Figure 52: Query Synthesis Prompt (Creative Writing). N Prompts We show the prompts in Figure 51 through 63. 45 Query Synthesis System Prompt (Instruction...
[72]

according to the knowledge

In the problem statement, provide the necessary context so that the problem is unambiguous, self-contained, and standalone. Never use: "according to the knowledge", "in the knowledge", "as mentioned", "the knowledge states", "based on the knowledge", etc
[73]

criterion_1

While you should include the necessary context, you should not mention or include any part of the ground truth answer in the problem statement. * Provide a reference answer within <answer>...</answer> tags. The reference answer should exclude any raw thinking traces. Figure 53: Query Synthesis Prompt (Instruction Following). Query Synthesis User Prompt <k...
[74]

In general, the reference answer should have a high quality compared to the candidate answers, but this is not always true

Group answers (both reference and candidate answers) by quality level. In general, the reference answer should have a high quality compared to the candidate answers, but this is not always true
[75]

Identify factors that separate the answers in high-quality groups from those in low-quality groups
[76]

factuality

Select the key factors with the highest discriminative power as the criteria to include in the rubric. * Each criterion needs to be atomic and focus on a single aspect of quality. Different criteria should not overlap with each other. * Each criterion needs to be specific to the problem as much as possible. If possible, avoid general criteria that are app...
[77]

YYY" of

The ground truth or gold standard answer needs to be as concise as possible, ideally a single sentence or phrase (e.g., The criterion is about property "YYY" of "XXX". The ground truth answer is "ZZZ". Put "ZZZ" in the "gold" field.)
[78]

Not applicable

If not applicable, put "Not applicable" in "gold". * For each criterion, its description in the "description" field needs to be as specific to the problem as possible
[79]

XXX". To describe it, instead of saying

Do not just give a general description for the criterion. Elaborate on it with key details, key facts, key phrases, keywords, and examples (e.g., The criterion is about understanding of "XXX". To describe it, instead of saying "The level of understanding of XXX", say "The level of understanding of XXX. Correct understanding should include YYY and ZZZ. ...")
[80]

criterion_1

Whenever applicable, connect your description to the gold standard answer that you provide in the "gold" field. * A small number of high-quality criteria are better than a large number of low-quality criteria. 2 specific, highly distinguishing criteria >> 5 general criteria that rate all answers equally. Figure 60: Rubric Generation System Prompt. 47 Rubr...