DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Akari Asai; David Sontag; Faeze Brahman; Hamish Ivison; Hannaneh Hajishirzi; Jingming Zhuo; Luca Soldaini; Luke Zettlemoyer; Molly Park; Pang Wei Koh

arxiv: 2511.19399 · v3 · pith:3JTHBD5Mnew · submitted 2025-11-24 · 💻 cs.CL · cs.AI· cs.LG

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao , Akari Asai , Shannon Zejiang Shen , Hamish Ivison , Varsha Kishore , Jingming Zhuo , Xinran Zhao , Molly Park

show 13 more authors

Samuel G. Finlayson David Sontag Tyler Murray Sewon Min Pradeep Dasigi Luca Soldaini Faeze Brahman Wen-tau Yih Tongshuang Wu Luke Zettlemoyer Yoon Kim Hannaneh Hajishirzi Pang Wei Koh

This is my paper

Pith reviewed 2026-05-21 18:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords reinforcement learningdeep research agentsevolving rubricslong-form researchopen modelsRLERmulti-step reasoningfact checking

0 comments

The pith

Reinforcement learning with co-evolving rubrics trains an open 8B model for long-form deep research that matches proprietary agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to move reinforcement learning beyond short verifiable questions to realistic open-ended research by letting evaluation rubrics grow and update alongside the model during training. Rubrics incorporate fresh search results and contrasts between model outputs so they stay relevant and give sharper feedback on factual accuracy and quality. This produces DR Tulu-8B, the first fully open model trained end-to-end for multi-step research, which beats other open agents and equals or exceeds closed systems on science, healthcare, and general benchmarks. Readers would care because the method delivers strong performance in a model that is far smaller and cheaper to query than current proprietary alternatives.

Core claim

The central claim is that rubrics constructed from search results and contrasting model responses can be maintained to co-evolve with the policy model, incorporating newly explored information to enable better fact checking and more discriminative on-policy feedback. Using this Reinforcement Learning with Evolving Rubrics approach, the authors train Deep Research Tulu (DR Tulu-8B), which substantially outperforms existing open deep research agents by 15.6 percent on average across four long-form benchmarks and matches or exceeds proprietary agents by 0.7 percent on average while remaining significantly smaller and 1000 times cheaper per query.

What carries the argument

Reinforcement Learning with Evolving Rubrics (RLER), a training loop in which rubrics are constructed and updated during policy training to absorb new search information and response contrasts for ongoing feedback.

If this is right

Open models can be trained directly on realistic multi-step research tasks instead of relying on short-form proxies.
Performance on science, healthcare, and general long-form benchmarks improves by double-digit margins over prior open agents.
Comparable accuracy to closed deep-research systems becomes achievable at orders-of-magnitude lower inference cost.
Smaller models become viable for sustained open-ended research when supervision adapts throughout training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same co-evolution idea could be tested on other open-ended domains such as code synthesis or multi-document synthesis where ground truth is hard to verify in advance.
If rubric quality scales with model capability, larger open models trained this way might widen the gap over fixed-rubric baselines.
Public release of both the trained model and the evolving-rubric code would let independent groups measure how rubric drift affects long-horizon factuality.

Load-bearing premise

Rubrics built from search results and contrasting model responses can keep evolving with the policy without introducing systematic errors or losing their power to distinguish good answers.

What would settle it

A controlled training run that applies RLER versus standard verifiable-reward RL on the same long-form benchmarks and finds no gain or a clear drop in final performance would indicate the evolving-rubric mechanism is not delivering the claimed benefit.

read the original abstract

Deep research agents perform multi-step research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards, which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), where rubrics are constructed and maintained to co-evolve with the policy model during training. This allows the rubrics to incorporate newly explored information from search and contrasting model responses, enabling better fact checking and more discriminative on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first fully open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare, and general domains, DR Tulu substantially outperforms existing open deep research agents (by 15.6% over Tongyi DR on average) and matches or exceeds proprietary deep research agents (by 0.7% over OpenAI DR on average), while being significantly smaller and cheaper per query (1000x cheaper than OpenAI DR per query).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Reinforcement Learning with Evolving Rubrics (RLER) to enable training of deep research agents on realistic long-form, open-ended tasks. Rubrics are dynamically constructed and updated during training to incorporate new information from search results and contrasting model responses, providing on-policy feedback for fact-checking. This yields DR Tulu-8B, presented as the first fully open model directly trained for such tasks. On four long-form benchmarks spanning science, healthcare, and general domains, it reports average gains of 15.6% over open agents like Tongyi DR and parity or slight improvement (0.7%) over proprietary systems like OpenAI DR, while being smaller and substantially cheaper per query.

Significance. If the results hold under scrutiny, the work offers a practical approach to scaling reinforcement learning beyond short-form verifiable rewards to complex research agents. The open release of an 8B model achieving competitive performance on long-form tasks would be a useful contribution to the field, particularly for enabling further research on agent training methods that adapt rewards dynamically.

major comments (3)

[RLER method description] In the RLER method description, the process for constructing and maintaining rubrics from search results and contrasting responses provides no details on validation, filtering, or oversight steps to ensure accuracy of added content. This is load-bearing for the central claim of improved fact-checking and the headline performance gains, as unfiltered search data in open domains can introduce partial, outdated, or contradictory information that propagates into the reward signal.
[Evaluation and results] The evaluation reports average percentage gains across the four benchmarks without error bars, statistical significance tests, or details on rubric construction per benchmark or run. This undermines assessment of whether the 15.6% and 0.7% improvements are robust, especially for open-ended long-form outputs where variance is expected.
[Rubric co-evolution process] The description of rubric co-evolution does not address how update frequency and selection criteria (free parameters in the approach) are set or validated to preserve discriminative power. Without ablations or analysis showing that co-evolution does not reduce reliability over training, the assumption that rubrics remain effective for on-policy feedback requires stronger empirical grounding.

minor comments (2)

[Abstract] The abstract could more precisely specify the exact metrics underlying the reported averages and name the four benchmarks explicitly for immediate clarity.
[Method] Notation for rubric components and update rules could be formalized with equations or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their constructive feedback on our manuscript. We address each of the major comments in detail below and outline the revisions we plan to make to improve the clarity and rigor of the presentation.

read point-by-point responses

Referee: In the RLER method description, the process for constructing and maintaining rubrics from search results and contrasting responses provides no details on validation, filtering, or oversight steps to ensure accuracy of added content. This is load-bearing for the central claim of improved fact-checking and the headline performance gains, as unfiltered search data in open domains can introduce partial, outdated, or contradictory information that propagates into the reward signal.

Authors: We thank the referee for this observation. The current description in Section 3 focuses on the high-level process of rubric evolution but omits specifics on validation and filtering. We will revise the manuscript to include a new paragraph detailing the oversight steps: specifically, we employ automated consistency checks by comparing added information against the top-k search results and discard entries that conflict with the majority of retrieved sources. This addition will better substantiate the reliability of the reward signal. revision: yes
Referee: The evaluation reports average percentage gains across the four benchmarks without error bars, statistical significance tests, or details on rubric construction per benchmark or run. This undermines assessment of whether the 15.6% and 0.7% improvements are robust, especially for open-ended long-form outputs where variance is expected.

Authors: We agree that the evaluation section would benefit from additional statistical analysis. In the revised version, we will report error bars based on multiple independent training runs and include results from statistical significance tests (e.g., Wilcoxon signed-rank test) comparing DR Tulu against baselines. We will also provide per-benchmark details on rubric construction parameters used during evaluation. revision: yes
Referee: The description of rubric co-evolution does not address how update frequency and selection criteria (free parameters in the approach) are set or validated to preserve discriminative power. Without ablations or analysis showing that co-evolution does not reduce reliability over training, the assumption that rubrics remain effective for on-policy feedback requires stronger empirical grounding.

Authors: We appreciate the call for more empirical validation of the co-evolution hyperparameters. We will add an ablation study in the supplementary material analyzing the impact of different update frequencies (e.g., every 100 vs. 500 steps) and selection criteria on rubric quality and downstream performance. The results will provide empirical grounding for the effectiveness of the evolving rubrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results measured on external benchmarks

full rationale

The paper introduces RLER as a training procedure in which rubrics co-evolve with the policy by incorporating search results and contrasting responses. Performance claims for DR Tulu-8B are then evaluated on four independent long-form benchmarks in science, healthcare, and general domains, with reported gains (15.6% over Tongyi DR, 0.7% over OpenAI DR) expressed relative to those external metrics. No equation or result reduces by construction to a fitted parameter or self-defined quantity inside the training loop, and no self-citation chain is invoked to justify uniqueness or forbid alternatives. The derivation is therefore self-contained against external, falsifiable benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method rests on the premise that dynamic rubric updates from external search and model contrasts yield reliable, improving feedback signals; no independent verification of this premise is provided in the abstract.

free parameters (1)

rubric update frequency and selection criteria
Parameters controlling how often and which new information from search and contrasts are folded into rubrics are not specified but required for co-evolution.

axioms (1)

domain assumption New information from search and contrasting responses can be integrated into rubrics to improve fact-checking without introducing bias or noise.
Invoked in the description of how rubrics enable better on-policy feedback during training.

invented entities (1)

Evolving Rubrics no independent evidence
purpose: Provide co-evolving, discriminative feedback for RL on long-form research tasks.
New construct introduced to address limitations of fixed verifiable rewards on short-form QA.

pith-pipeline@v0.9.0 · 5827 in / 1429 out tokens · 74304 ms · 2026-05-21T18:08:10.403133+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
cs.AI 2026-05 unverdicted novelty 7.0

Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
cs.CL 2026-05 unverdicted novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
cs.CL 2026-05 conditional novelty 7.0

LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
cs.CL 2026-05 accept novelty 7.0

LLM agents encode tool necessity in pre-generation hidden states with high linear decodability (AUROC 0.89-0.96); Probe&Prefill uses this to reduce tool calls 48% with 1.7% accuracy loss.
Rubric-based On-policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
cs.LG 2026-03 unverdicted novelty 7.0

ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
cs.AI 2026-05 unverdicted novelty 6.0

POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
Reward Hacking in Rubric-Based Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
cs.CL 2026-05 unverdicted novelty 6.0

DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
cs.CL 2026-04 unverdicted novelty 6.0

POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Olmo Hybrid: From Theory to Practice and Back
cs.LG 2026-04 conditional novelty 6.0

A 7B hybrid attention-recurrent model outperforms its pure-transformer counterpart on pretraining metrics and scales more efficiently, supported by a proof that hybrids are strictly more expressive than either transfo...
Self-Optimizing Multi-Agent Systems for Deep Research
cs.IR 2026-04 unverdicted novelty 6.0

Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
Differentiable Evolutionary Reinforcement Learning
cs.AI 2025-12 unverdicted novelty 6.0

DERL is a differentiable bi-level method that evolves optimal reward structures for RL policies by composing atomic primitives and using meta-gradients from validation performance.
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
cs.CL 2026-05 unverdicted novelty 5.0

GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
cs.CL 2026-05 unverdicted novelty 5.0

GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass inference with modular flexibility.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 17 Pith papers · 1 internal anchor

[1]

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.580.https:// aclanthology.org/2020.coling-main.580/. Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, and Alan Schelten. Compute as teacher: Turning inference compute into reference-free supervision.arXiv preprin...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.coling-main.580.https:// 2020
[2]

**Positive Rubrics**: Excellence indicators distinguishing superior responses

work page
[3]

Provides clear explanations

**Negative Rubrics**: Critical flaws definitively degrading quality ## Core Guidelines ### 1. Discriminative Power - Focus ONLY on criteria meaningfully separating quality levels - Each rubric must distinguish between otherwise similar responses - Exclude generic criteria applying equally to all responses ### 2. Novelty & Non-Redundancy With existing/grou...

work page
[4]

Group responses by quality level

work page
[5]

Find factors separating higher/lower clusters

work page
[6]

Check if factors covered by existing rubrics

work page
[7]

question

Select criteria with highest discriminative value Figure 11Systempromptforgeneratingevolvingrubrics.Notethatthisisthefirst-halfofthepromptandthesecond-half is in Figure 12 27 Evolving Rubric Generation Prompt (Part 2) ## Output Format ```json { "question": "<original question verbatim>", "positive_rubrics": [ {"description": "<detailed excellence descript...

work page
[8]

**Question**: Original question being answered

work page
[9]

**Responses**: Multiple model responses (Response 1, Response 2, etc.)

work page
[10]

Figure 12Continuation of the system prompt for generating evolving rubrics

**Existing Rubrics** (optional): Previously generated/ground truth rubrics ## Critical Reminders - Each rubric must distinguish between actual provided responses - Exclude rubrics applying equally to all responses - Prefer empty lists over redundancy when existing rubrics are comprehensive - Focus on observable, objective, actionable criteria - Quality ov...

work page
[11]

Answer format.Whether y encloses a final answer between<answer></answer> tags, producing a binary indicatora(y)∈ {0,1}

work page
[12]

Citation format.Whether y contains at least one citation enclosed in<cite></cite>tags, producing c(y)∈ {0,1}

work page
[13]

The response should mention the first paper that proposed RAG

Query format.Whether y includes at least one valid search query enclosed in<query></query>tags (or parser-specific equivalents), producingq(y)∈ {0,1}. We then define a weighted format reward as rfmt(x, y) = 0.5a(y) + 0.3c(y) + 0.2q(y), r fmt(x, y)∈[0,1]. This reward acts as a low-cost signal that steers the model toward producing well-formed outputs align...

work page 2025
[14]

Read the question and the list of criteria carefully

work page
[15]

**Distinguishing referential vs

For each criterion, decide first whether it *makes a factual claim* — that is, whether it asserts something that can be verified as true or false in the real world. **Distinguishing referential vs. assertive phrasing:** - Referential (→NA): Criteria that only ask to *mention*, *explain*, *describe*, *discuss*, or *include information about* something, wit...

work page
[16]

If the criterion is about writing style, tone, clarity, structure, or formatting, or if it only requires mentioning or explaining topics without specifying factual assertions, return 'NA'

work page
[17]

- If evidence confirms it→factual and correct

For each factual claim, check whether it can be verified using reliable evidence or reasoning. - If evidence confirms it→factual and correct. - If reliable evidence contradicts it→factual but incorrect. - If no verifiable evidence is found (e.g., no data, no known sources)→factual but *unverified*

work page
[18]

- Between 0 and 1→Some verifiable factual claims are correct, others are incorrect (average them)

Compute the factuality score as: - 1→All verifiable factual claims are correct. - Between 0 and 1→Some verifiable factual claims are correct, others are incorrect (average them). - 0→All verifiable factual claims are incorrect. -'NA' →None of the criteria makes any factual claims

work page
[19]

Do *not* lower the score for claims that are unverified (i.e., lacking evidence) unless there is evidence showing they are *false*

work page
[20]

factual_score

Also count how many criteria are assertive but unverified. Output Format: Return your result strictly in JSON format as follows: {{"factual_score": <float_or_"NA">, "explanation": "<short explanation>", "num_non_na_criteria": <number>, "num_na_criteria": <number>, "num_unverified_assertive_criteria": <number>}} Now evaluate the following: Question: {quest...

work page
[21]

Requires external knowledge (factual/domain content from web/docs/papers/data)

work page
[22]

Requires complex planning (multi-source search, comparisons, aggregation, synthesis)

work page
[23]

Often expects long-form responses

work page
[24]

Cannot be answered well from parametric knowledge alone (up-to-date or niche)

work page
[25]

modern PCs

Is evaluable by a single answer or clear rubrics (metrics, dates, versions, counts). Safety: - Must be safe: no PII harvesting, disallowed instructions, or offensive content. Scoring (integers only): 1 = Trivial/chit-chat; no retrieval; not evaluable. 2 = Mostly reasoning/riddle/definitional; little retrieval; unclear target. 3 = Some retrieval and synthe...

work page
[26]

- Ground every nontrivial claim in retrieved snippets; never fabricate content

Operating Principles, Process & Guidelines 1.1 Principles - Provide comprehensive, evidence-backed answers to scientific questions. - Ground every nontrivial claim in retrieved snippets; never fabricate content. Cite using <cite id="...">...</cite> drawn only from returned snippets. - Prefer authoritative sources (peer-reviewed papers, reputable benchmark...

work page
[27]

**Initial plan** — Begin with a`<think>`that decomposes the question, lists assumptions, outlines a concrete search plan (start broad -> ablations/benchmarks -> domain-specific; include venues/years), and defines the first query

work page
[28]

- Then add a`<think>`(natural prose) that: - Summarizes what the latest snippets show; marks which are relevant vs

**Query -> Snippets -> Think** — For each iteration: - Run a`<call_tool>`and read the returned`<snippet>`results. - Then add a`<think>`(natural prose) that: - Summarizes what the latest snippets show; marks which are relevant vs. irrelevant **and why**. - Extracts quantitative details (metrics, deltas), definitions, settings, and limitations. - States wha...

work page
[29]

ingredients

**Sufficiency check** — When evidence is adequate for a precise answer (including trade-offs), synthesize a single`<answer>`with section headers and inline citations. Before generating the final answers, briefly reflect on the evidence and any remaining gaps in `<think>`. Carefully think about the structure of the responses, write it down inside <think>, ...

work page 2048
[30]

google_search

google_search - Purpose: general web search. - Input via: <call_tool name="google_search">your query</call_tool> - Output: web search snippets (see SEARCH RESULTS). - Optional parameters - gl: geolocation - hl: host language

work page
[31]

browse_webpage

browse_webpage - Purpose: open a specific URL (typically one returned by google_search) and extract readable page text as snippets. - Input via: <call_tool name="browse_webpage">https://example.com/article</call_tool> - Output: webpage (see SEARCH RESULTS)

work page
[32]

snippet_search

snippet_search - Purpose: focused snippet retrieval from scientific papers - Input via: <call_tool name="snippet_search">your query</call_tool> - Output: snippets from existing papers (see SEARCH RESULTS). - Examples: <call_tool name="snippet_search" limit="8" year="2021-2025" fieldsOfStudy="Computer Science, Medicine">large language model retrieval evalu...

work page 2021
[33]

Start with surveys/reviews linking AI in education to learning theories (2020–2025), ideally in AIED, BJET, Computers \& Education, Review of Educational Research, LAK, EDM venues

work page 2020
[34]

Retrieve authoritative sources on ITS and adaptive learning (classic + recent advances like knowledge tracing, deep KT, mastery learning, LLM tutoring)

work page
[35]

Gather learning analytics frameworks and applications (early warning systems, dashboards, ethics)

work page
[36]

snippet_search

Collect NLP-in-education applications (automated scoring, feedback, dialogue tutoring, LLMs) and evidence on teacher efficiency and student engagement. First query will target a broad review tying theories and AI models together. </think> <call_tool name="snippet_search" year="2020-2025" limit="10" fieldsOfStudy="Education,Computer Science">AI in educatio...

work page 2020
[37]

narrow" vs

Search for authoritative definitions of "narrow" vs "broad" stakeholder engagement in CSR

work page
[38]

narrow scope

Find sources that explicitly define "narrow scope" CSR (e.g., compliance-only, triple bottom line metrics) vs "broad scope" (stakeholder dialogue, materiality, governance, long-term value)

work page
[39]

Retrieve evidence on who the reports address (shareholders vs a wider stakeholder community), including practitioner frameworks (e.g., AA1000AS stakeholder engagement principle)

work page
[40]

narrow and broad approaches to CSR reporting definition stakeholders scope

Cross-check with standards like GRI (comprehensive, wide stakeholder audience) and SASB (focused, investor-material), to map scope vs audience. First query: "narrow and broad approaches to CSR reporting definition stakeholders scope". </think> <call_tool name="google_search" num="5" gl="us" hl="en">narrow and broad approaches to CSR reporting definition s...

work page doi:10.1186/s40991-024-00094-y
[41]

Find authoritative definitions and contrasts (preferably from academic papers and reputable frameworks)

work page
[42]

Use the first Google search results to identify key sources, then browse pages for precise language on scope and audience

work page
[43]

Corroborate with recognized reporting frameworks (AA1000AS stakeholder engagement principle; GRI vs SASB scope differences)

work page
[44]

AA1000AS stakeholder engagement principle 1 materiality definition

Ensure at least four searches: initial Google search done; next, targeted searches for AA1000AS, GRI/SASB comparison, and a paper explicitly discussing narrow vs broad CSR reporting scope. Reflection on first search: - PMC article (S_d7607bb2-0) mentions that approach choice depends on intended scope and target audience—relevant framing. - Sustainability ...

work page arXiv 2011

[1] [1]

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.580.https:// aclanthology.org/2020.coling-main.580/. Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, and Alan Schelten. Compute as teacher: Turning inference compute into reference-free supervision.arXiv preprin...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.coling-main.580.https:// 2020

[2] [2]

**Positive Rubrics**: Excellence indicators distinguishing superior responses

work page

[3] [3]

Provides clear explanations

**Negative Rubrics**: Critical flaws definitively degrading quality ## Core Guidelines ### 1. Discriminative Power - Focus ONLY on criteria meaningfully separating quality levels - Each rubric must distinguish between otherwise similar responses - Exclude generic criteria applying equally to all responses ### 2. Novelty & Non-Redundancy With existing/grou...

work page

[4] [4]

Group responses by quality level

work page

[5] [5]

Find factors separating higher/lower clusters

work page

[6] [6]

Check if factors covered by existing rubrics

work page

[7] [7]

question

Select criteria with highest discriminative value Figure 11Systempromptforgeneratingevolvingrubrics.Notethatthisisthefirst-halfofthepromptandthesecond-half is in Figure 12 27 Evolving Rubric Generation Prompt (Part 2) ## Output Format ```json { "question": "<original question verbatim>", "positive_rubrics": [ {"description": "<detailed excellence descript...

work page

[8] [8]

**Question**: Original question being answered

work page

[9] [9]

**Responses**: Multiple model responses (Response 1, Response 2, etc.)

work page

[10] [10]

Figure 12Continuation of the system prompt for generating evolving rubrics

**Existing Rubrics** (optional): Previously generated/ground truth rubrics ## Critical Reminders - Each rubric must distinguish between actual provided responses - Exclude rubrics applying equally to all responses - Prefer empty lists over redundancy when existing rubrics are comprehensive - Focus on observable, objective, actionable criteria - Quality ov...

work page

[11] [11]

Answer format.Whether y encloses a final answer between<answer></answer> tags, producing a binary indicatora(y)∈ {0,1}

work page

[12] [12]

Citation format.Whether y contains at least one citation enclosed in<cite></cite>tags, producing c(y)∈ {0,1}

work page

[13] [13]

The response should mention the first paper that proposed RAG

Query format.Whether y includes at least one valid search query enclosed in<query></query>tags (or parser-specific equivalents), producingq(y)∈ {0,1}. We then define a weighted format reward as rfmt(x, y) = 0.5a(y) + 0.3c(y) + 0.2q(y), r fmt(x, y)∈[0,1]. This reward acts as a low-cost signal that steers the model toward producing well-formed outputs align...

work page 2025

[14] [14]

Read the question and the list of criteria carefully

work page

[15] [15]

**Distinguishing referential vs

For each criterion, decide first whether it *makes a factual claim* — that is, whether it asserts something that can be verified as true or false in the real world. **Distinguishing referential vs. assertive phrasing:** - Referential (→NA): Criteria that only ask to *mention*, *explain*, *describe*, *discuss*, or *include information about* something, wit...

work page

[16] [16]

If the criterion is about writing style, tone, clarity, structure, or formatting, or if it only requires mentioning or explaining topics without specifying factual assertions, return 'NA'

work page

[17] [17]

- If evidence confirms it→factual and correct

For each factual claim, check whether it can be verified using reliable evidence or reasoning. - If evidence confirms it→factual and correct. - If reliable evidence contradicts it→factual but incorrect. - If no verifiable evidence is found (e.g., no data, no known sources)→factual but *unverified*

work page

[18] [18]

- Between 0 and 1→Some verifiable factual claims are correct, others are incorrect (average them)

Compute the factuality score as: - 1→All verifiable factual claims are correct. - Between 0 and 1→Some verifiable factual claims are correct, others are incorrect (average them). - 0→All verifiable factual claims are incorrect. -'NA' →None of the criteria makes any factual claims

work page

[19] [19]

Do *not* lower the score for claims that are unverified (i.e., lacking evidence) unless there is evidence showing they are *false*

work page

[20] [20]

factual_score

Also count how many criteria are assertive but unverified. Output Format: Return your result strictly in JSON format as follows: {{"factual_score": <float_or_"NA">, "explanation": "<short explanation>", "num_non_na_criteria": <number>, "num_na_criteria": <number>, "num_unverified_assertive_criteria": <number>}} Now evaluate the following: Question: {quest...

work page

[21] [21]

Requires external knowledge (factual/domain content from web/docs/papers/data)

work page

[22] [22]

Requires complex planning (multi-source search, comparisons, aggregation, synthesis)

work page

[23] [23]

Often expects long-form responses

work page

[24] [24]

Cannot be answered well from parametric knowledge alone (up-to-date or niche)

work page

[25] [25]

modern PCs

Is evaluable by a single answer or clear rubrics (metrics, dates, versions, counts). Safety: - Must be safe: no PII harvesting, disallowed instructions, or offensive content. Scoring (integers only): 1 = Trivial/chit-chat; no retrieval; not evaluable. 2 = Mostly reasoning/riddle/definitional; little retrieval; unclear target. 3 = Some retrieval and synthe...

work page

[26] [26]

- Ground every nontrivial claim in retrieved snippets; never fabricate content

Operating Principles, Process & Guidelines 1.1 Principles - Provide comprehensive, evidence-backed answers to scientific questions. - Ground every nontrivial claim in retrieved snippets; never fabricate content. Cite using <cite id="...">...</cite> drawn only from returned snippets. - Prefer authoritative sources (peer-reviewed papers, reputable benchmark...

work page

[27] [27]

**Initial plan** — Begin with a`<think>`that decomposes the question, lists assumptions, outlines a concrete search plan (start broad -> ablations/benchmarks -> domain-specific; include venues/years), and defines the first query

work page

[28] [28]

- Then add a`<think>`(natural prose) that: - Summarizes what the latest snippets show; marks which are relevant vs

**Query -> Snippets -> Think** — For each iteration: - Run a`<call_tool>`and read the returned`<snippet>`results. - Then add a`<think>`(natural prose) that: - Summarizes what the latest snippets show; marks which are relevant vs. irrelevant **and why**. - Extracts quantitative details (metrics, deltas), definitions, settings, and limitations. - States wha...

work page

[29] [29]

ingredients

**Sufficiency check** — When evidence is adequate for a precise answer (including trade-offs), synthesize a single`<answer>`with section headers and inline citations. Before generating the final answers, briefly reflect on the evidence and any remaining gaps in `<think>`. Carefully think about the structure of the responses, write it down inside <think>, ...

work page 2048

[30] [30]

google_search

google_search - Purpose: general web search. - Input via: <call_tool name="google_search">your query</call_tool> - Output: web search snippets (see SEARCH RESULTS). - Optional parameters - gl: geolocation - hl: host language

work page

[31] [31]

browse_webpage

browse_webpage - Purpose: open a specific URL (typically one returned by google_search) and extract readable page text as snippets. - Input via: <call_tool name="browse_webpage">https://example.com/article</call_tool> - Output: webpage (see SEARCH RESULTS)

work page

[32] [32]

snippet_search

snippet_search - Purpose: focused snippet retrieval from scientific papers - Input via: <call_tool name="snippet_search">your query</call_tool> - Output: snippets from existing papers (see SEARCH RESULTS). - Examples: <call_tool name="snippet_search" limit="8" year="2021-2025" fieldsOfStudy="Computer Science, Medicine">large language model retrieval evalu...

work page 2021

[33] [33]

Start with surveys/reviews linking AI in education to learning theories (2020–2025), ideally in AIED, BJET, Computers \& Education, Review of Educational Research, LAK, EDM venues

work page 2020

[34] [34]

Retrieve authoritative sources on ITS and adaptive learning (classic + recent advances like knowledge tracing, deep KT, mastery learning, LLM tutoring)

work page

[35] [35]

Gather learning analytics frameworks and applications (early warning systems, dashboards, ethics)

work page

[36] [36]

snippet_search

Collect NLP-in-education applications (automated scoring, feedback, dialogue tutoring, LLMs) and evidence on teacher efficiency and student engagement. First query will target a broad review tying theories and AI models together. </think> <call_tool name="snippet_search" year="2020-2025" limit="10" fieldsOfStudy="Education,Computer Science">AI in educatio...

work page 2020

[37] [37]

narrow" vs

Search for authoritative definitions of "narrow" vs "broad" stakeholder engagement in CSR

work page

[38] [38]

narrow scope

Find sources that explicitly define "narrow scope" CSR (e.g., compliance-only, triple bottom line metrics) vs "broad scope" (stakeholder dialogue, materiality, governance, long-term value)

work page

[39] [39]

Retrieve evidence on who the reports address (shareholders vs a wider stakeholder community), including practitioner frameworks (e.g., AA1000AS stakeholder engagement principle)

work page

[40] [40]

narrow and broad approaches to CSR reporting definition stakeholders scope

Cross-check with standards like GRI (comprehensive, wide stakeholder audience) and SASB (focused, investor-material), to map scope vs audience. First query: "narrow and broad approaches to CSR reporting definition stakeholders scope". </think> <call_tool name="google_search" num="5" gl="us" hl="en">narrow and broad approaches to CSR reporting definition s...

work page doi:10.1186/s40991-024-00094-y

[41] [41]

Find authoritative definitions and contrasts (preferably from academic papers and reputable frameworks)

work page

[42] [42]

Use the first Google search results to identify key sources, then browse pages for precise language on scope and audience

work page

[43] [43]

Corroborate with recognized reporting frameworks (AA1000AS stakeholder engagement principle; GRI vs SASB scope differences)

work page

[44] [44]

AA1000AS stakeholder engagement principle 1 materiality definition

Ensure at least four searches: initial Google search done; next, targeted searches for AA1000AS, GRI/SASB comparison, and a paper explicitly discussing narrow vs broad CSR reporting scope. Reflection on first search: - PMC article (S_d7607bb2-0) mentions that approach choice depends on intended scope and target audience—relevant framing. - Sustainability ...

work page arXiv 2011