pith. sign in

arxiv: 2605.29886 · v1 · pith:3RBPGPZBnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

Pith reviewed 2026-06-29 07:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords retrieval-augmented generationstructured criticreinforcement learningerror diagnosisquestion answeringGRPOhallucination reduction
0
0 comments X

The pith

CRITIC-R1 trains a structured critic via reinforcement learning to diagnose RAG errors along explicit dimensions rather than giving coarse feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to reduce hallucinations and reasoning errors in retrieval-augmented generation by replacing coarse external critics with a model that explicitly diagnoses errors. It breaks RAG failures into four diagnostic dimensions and trains the critic with GRPO reinforcement learning guided by two alignment rewards. The Conservative Judgement Alignment reward curbs over-aggressive corrections while Diagnostic Quality Alignment improves fine-grained analysis through gated signals. Process-level labels come from external LLM teachers. Experiments on five QA benchmarks show the resulting critic lifts answer quality over strong RAG baselines.

Core claim

CRITIC-R1 formulates RAG critique as an explicit error-diagnosis task and solves it by training a critic model with GRPO-based reinforcement learning on process-level supervision from LLM teachers. The model outputs verdicts, error locations, reasoning analyses, and fix generations. Training uses Conservative Judgement Alignment to produce calibrated high-level judgments and Diagnostic Quality Alignment with gated rewards to refine the diagnostic dimensions.

What carries the argument

Structured error-diagnosis critic with four output dimensions trained by GRPO RL under Conservative Judgement Alignment and Diagnostic Quality Alignment rewards.

If this is right

  • Answer quality rises consistently across five QA benchmarks relative to strong RAG baselines.
  • Over-aggressive interventions decrease because high-level verdicts are calibrated.
  • Fine-grained diagnostic feedback becomes more reliable through the gated reward design.
  • Hallucinations and subtle reasoning errors are reduced by targeted location and fix signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diagnostic dimensions could be reused to evaluate or debug non-RAG generation pipelines if the error taxonomy is extended.
  • Process-level RL supervision from teachers might lower the need for human preference data in critic training.
  • If the four dimensions prove stable, they could serve as an automatic rubric for comparing future RAG systems.

Load-bearing premise

Process-level supervision collected from external LLM teacher models is sufficiently accurate and unbiased.

What would settle it

Running the trained critic on a held-out set of RAG traces where the teacher LLM labels are independently shown to be wrong, then measuring whether answer quality still rises over baselines.

Figures

Figures reproduced from arXiv: 2605.29886 by Chuanyue Yu, Jianxin Li, Qingyun Sun, Runhua Xu, Wenhan Xiao, Xingcheng Fu, Ziwei Zhang.

Figure 1
Figure 1. Figure 1: An illustrative comparison of different RAG paradigms. (A) LLM-only method directly generates an [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of CRITIC-R1: (a) We formulate RAG critic as a structured critique framework, including [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Error distribution on HotpotQA using Search [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The confusion matrix of CRITIC-R1 for error [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The confusion matrix of CRITIC-R1 for error [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Critique distributions before and after training. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt templates used in the structured cri [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce external critics to refine RAG outputs, yet they often provide coarse-grained and weakly structured feedback, exhibit over-aggressive intervention, and lead to noisy and unreliable refinement, limiting their effectiveness for correction. To tackle these issues, we propose CRITIC-R1, a structured critic framework that formulates and learns RAG critique as an explicit error diagnosis problem using reinforcement learning (RL). Our framework categorizes common RAG errors into multiple diagnostic dimensions, including verdict, error location, reasoning analysis, and fix generation. To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment (DQA) further improves fine-grained diagnostic feedback through gated rewards. We train the critic model using GRPO-based RL with process-level supervision collected from external LLM teacher models. Experiments across five QA benchmarks show that CRITIC-R1 consistently improves answer quality over strong RAG baselines. Our source code is available at https://anonymous.4open.science/r/critic-r1-FCB0

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes CRITIC-R1, a structured critic framework for RAG that formulates error diagnosis as an explicit multi-dimensional problem (verdict, error location, reasoning analysis, fix generation) and trains the critic via GRPO reinforcement learning. Process-level supervision is collected from external LLM teachers to define two reward functions: Conservative Judgement Alignment (CJA) for calibrated high-level judgments and Diagnostic Quality Alignment (DQA) for gated fine-grained feedback. Experiments across five QA benchmarks are reported to show consistent gains over strong RAG baselines, with source code released.

Significance. If the central claims hold after addressing validation gaps, the work could meaningfully advance RAG refinement by moving beyond coarse external critics to learned, structured diagnosis that mitigates over-aggressive intervention. Explicit credit is due for releasing source code at the provided anonymous repository, which supports reproducibility.

major comments (2)
  1. [Training procedure and reward definitions] The training procedure relies entirely on process-level labels (verdict, error location, reasoning analysis, fix generation) collected from external LLM teachers to define both CJA and DQA rewards, yet no human validation, inter-teacher agreement metrics, or systematic error analysis of the teacher outputs is described. This is load-bearing for the claim that performance gains arise from improved diagnosis rather than teacher mimicry, as any systematic teacher bias would be internalized by the GRPO-trained critic.
  2. [Experiments section] The abstract states that experiments across five QA benchmarks show consistent improvements, but provides no information on baselines, metrics, ablation controls for the diagnostic dimensions, or statistical significance. Without these, it is impossible to assess whether the structured critic contributes beyond what the teacher labels already encode.
minor comments (1)
  1. The repository link is given as anonymous; consider replacing it with a permanent, non-anonymous URL in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Training procedure and reward definitions] The training procedure relies entirely on process-level labels (verdict, error location, reasoning analysis, fix generation) collected from external LLM teachers to define both CJA and DQA rewards, yet no human validation, inter-teacher agreement metrics, or systematic error analysis of the teacher outputs is described. This is load-bearing for the claim that performance gains arise from improved diagnosis rather than teacher mimicry, as any systematic teacher bias would be internalized by the GRPO-trained critic.

    Authors: We agree that the absence of human validation and agreement metrics for the teacher-generated labels is a limitation that weakens the claim distinguishing learned diagnosis from mimicry. In the revised manuscript we will add a new subsection under Section 3.3 that reports (i) human evaluation of a random sample of 200 teacher labels across the four diagnostic dimensions, (ii) inter-annotator agreement (Cohen’s kappa) between two human raters, and (iii) a qualitative error analysis of the most frequent teacher mistakes. These additions will be used to qualify the reliability of the process-level supervision. revision: yes

  2. Referee: [Experiments section] The abstract states that experiments across five QA benchmarks show consistent improvements, but provides no information on baselines, metrics, ablation controls for the diagnostic dimensions, or statistical significance. Without these, it is impossible to assess whether the structured critic contributes beyond what the teacher labels already encode.

    Authors: The abstract is intentionally brief; the full experimental details appear in Section 4. Nevertheless, we acknowledge that the current presentation does not sufficiently highlight the requested elements. In revision we will (i) expand the abstract with one additional sentence naming the five benchmarks, the primary metric (exact match), and the main baselines, (ii) add a dedicated ablation table isolating each diagnostic dimension, and (iii) report statistical significance (paired t-tests with p-values) for all main results. These changes will make the contribution of the structured critic clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical method: it defines diagnostic dimensions and two custom reward functions (CJA, DQA) that operate on process-level labels collected from external LLM teachers, then trains via GRPO RL and reports benchmark gains. No equations, predictions, or first-principles claims are shown to reduce by construction to the inputs; the central result is an observed performance delta rather than a tautological renaming or self-referential fit. No self-citations appear as load-bearing premises. The approach is self-contained as a supervised RL pipeline whose validity rests on external validation of the teacher labels and the reported experiments, not on internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5777 in / 1060 out tokens · 23951 ms · 2026-06-29T07:46:19.409341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 6465–6488

    Enabling large language models to generate text with citations. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 6465–6488. Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen

  2. [2]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and 1 oth- ers. 2025. A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, ...

  3. [3]

    InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics, pages 7064–7074

    Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics, pages 7064–7074. Shuguang Jiao, Chengkai Huang, Shuhan Qi, Xuan Wang, Yifan Li, and Lina Yao. 2026. Doctor-rag: Failure-aware repair...

  4. [4]

    9 Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others

    Ra-dit: Retrieval-augmented dual instruction tuning.arXiv preprint arXiv:2310.01352. 9 Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others. 2023. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:4...

  5. [5]

    Corrective Retrieval Augmented Generation

    Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal methods in natural language p...

  6. [6]

    search again

    Self-contrast: Better reflection through incon- sistent solving perspectives. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3602–3622. Association for Computational Linguis- tics. Yujia Zhou, Zheng Liu, Jiajie Jin, Jian-Yun Nie, and Zhicheng Dou. 2024. Metacognitive retrieval- au...

  7. [7]

    A previous trajectory from an earlier attempt

  8. [8]

    Important rules: - The previous trajectory may contain mistakes

    An external critique of that previous trajectory. Important rules: - The previous trajectory may contain mistakes. - The previous final answer may be wrong. - The external critique may also be wrong. - Do NOT blindly trust the previous trajectory. - Do NOT blindly trust the critique. - Use the critique only as a hint about possible problems to check. - Re...