CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation
Pith reviewed 2026-06-29 07:46 UTC · model grok-4.3
The pith
CRITIC-R1 trains a structured critic via reinforcement learning to diagnose RAG errors along explicit dimensions rather than giving coarse feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CRITIC-R1 formulates RAG critique as an explicit error-diagnosis task and solves it by training a critic model with GRPO-based reinforcement learning on process-level supervision from LLM teachers. The model outputs verdicts, error locations, reasoning analyses, and fix generations. Training uses Conservative Judgement Alignment to produce calibrated high-level judgments and Diagnostic Quality Alignment with gated rewards to refine the diagnostic dimensions.
What carries the argument
Structured error-diagnosis critic with four output dimensions trained by GRPO RL under Conservative Judgement Alignment and Diagnostic Quality Alignment rewards.
If this is right
- Answer quality rises consistently across five QA benchmarks relative to strong RAG baselines.
- Over-aggressive interventions decrease because high-level verdicts are calibrated.
- Fine-grained diagnostic feedback becomes more reliable through the gated reward design.
- Hallucinations and subtle reasoning errors are reduced by targeted location and fix signals.
Where Pith is reading between the lines
- The same diagnostic dimensions could be reused to evaluate or debug non-RAG generation pipelines if the error taxonomy is extended.
- Process-level RL supervision from teachers might lower the need for human preference data in critic training.
- If the four dimensions prove stable, they could serve as an automatic rubric for comparing future RAG systems.
Load-bearing premise
Process-level supervision collected from external LLM teacher models is sufficiently accurate and unbiased.
What would settle it
Running the trained critic on a held-out set of RAG traces where the teacher LLM labels are independently shown to be wrong, then measuring whether answer quality still rises over baselines.
Figures
read the original abstract
Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce external critics to refine RAG outputs, yet they often provide coarse-grained and weakly structured feedback, exhibit over-aggressive intervention, and lead to noisy and unreliable refinement, limiting their effectiveness for correction. To tackle these issues, we propose CRITIC-R1, a structured critic framework that formulates and learns RAG critique as an explicit error diagnosis problem using reinforcement learning (RL). Our framework categorizes common RAG errors into multiple diagnostic dimensions, including verdict, error location, reasoning analysis, and fix generation. To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment (DQA) further improves fine-grained diagnostic feedback through gated rewards. We train the critic model using GRPO-based RL with process-level supervision collected from external LLM teacher models. Experiments across five QA benchmarks show that CRITIC-R1 consistently improves answer quality over strong RAG baselines. Our source code is available at https://anonymous.4open.science/r/critic-r1-FCB0
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CRITIC-R1, a structured critic framework for RAG that formulates error diagnosis as an explicit multi-dimensional problem (verdict, error location, reasoning analysis, fix generation) and trains the critic via GRPO reinforcement learning. Process-level supervision is collected from external LLM teachers to define two reward functions: Conservative Judgement Alignment (CJA) for calibrated high-level judgments and Diagnostic Quality Alignment (DQA) for gated fine-grained feedback. Experiments across five QA benchmarks are reported to show consistent gains over strong RAG baselines, with source code released.
Significance. If the central claims hold after addressing validation gaps, the work could meaningfully advance RAG refinement by moving beyond coarse external critics to learned, structured diagnosis that mitigates over-aggressive intervention. Explicit credit is due for releasing source code at the provided anonymous repository, which supports reproducibility.
major comments (2)
- [Training procedure and reward definitions] The training procedure relies entirely on process-level labels (verdict, error location, reasoning analysis, fix generation) collected from external LLM teachers to define both CJA and DQA rewards, yet no human validation, inter-teacher agreement metrics, or systematic error analysis of the teacher outputs is described. This is load-bearing for the claim that performance gains arise from improved diagnosis rather than teacher mimicry, as any systematic teacher bias would be internalized by the GRPO-trained critic.
- [Experiments section] The abstract states that experiments across five QA benchmarks show consistent improvements, but provides no information on baselines, metrics, ablation controls for the diagnostic dimensions, or statistical significance. Without these, it is impossible to assess whether the structured critic contributes beyond what the teacher labels already encode.
minor comments (1)
- The repository link is given as anonymous; consider replacing it with a permanent, non-anonymous URL in the camera-ready version.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Training procedure and reward definitions] The training procedure relies entirely on process-level labels (verdict, error location, reasoning analysis, fix generation) collected from external LLM teachers to define both CJA and DQA rewards, yet no human validation, inter-teacher agreement metrics, or systematic error analysis of the teacher outputs is described. This is load-bearing for the claim that performance gains arise from improved diagnosis rather than teacher mimicry, as any systematic teacher bias would be internalized by the GRPO-trained critic.
Authors: We agree that the absence of human validation and agreement metrics for the teacher-generated labels is a limitation that weakens the claim distinguishing learned diagnosis from mimicry. In the revised manuscript we will add a new subsection under Section 3.3 that reports (i) human evaluation of a random sample of 200 teacher labels across the four diagnostic dimensions, (ii) inter-annotator agreement (Cohen’s kappa) between two human raters, and (iii) a qualitative error analysis of the most frequent teacher mistakes. These additions will be used to qualify the reliability of the process-level supervision. revision: yes
-
Referee: [Experiments section] The abstract states that experiments across five QA benchmarks show consistent improvements, but provides no information on baselines, metrics, ablation controls for the diagnostic dimensions, or statistical significance. Without these, it is impossible to assess whether the structured critic contributes beyond what the teacher labels already encode.
Authors: The abstract is intentionally brief; the full experimental details appear in Section 4. Nevertheless, we acknowledge that the current presentation does not sufficiently highlight the requested elements. In revision we will (i) expand the abstract with one additional sentence naming the five benchmarks, the primary metric (exact match), and the main baselines, (ii) add a dedicated ablation table isolating each diagnostic dimension, and (iii) report statistical significance (paired t-tests with p-values) for all main results. These changes will make the contribution of the structured critic clearer. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical method: it defines diagnostic dimensions and two custom reward functions (CJA, DQA) that operate on process-level labels collected from external LLM teachers, then trains via GRPO RL and reports benchmark gains. No equations, predictions, or first-principles claims are shown to reduce by construction to the inputs; the central result is an observed performance delta rather than a tautological renaming or self-referential fit. No self-citations appear as load-bearing premises. The approach is self-contained as a supervised RL pipeline whose validity rests on external validation of the teacher labels and the reported experiments, not on internal definitional equivalence.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 6465–6488
Enabling large language models to generate text with citations. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 6465–6488. Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen
2023
-
[2]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and 1 oth- ers. 2025. A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics, pages 7064–7074. Shuguang Jiao, Chengkai Huang, Shuhan Qi, Xuan Wang, Yifan Li, and Lina Yao. 2026. Doctor-rag: Failure-aware repair...
-
[4]
Ra-dit: Retrieval-augmented dual instruction tuning.arXiv preprint arXiv:2310.01352. 9 Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others. 2023. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:4...
-
[5]
Corrective Retrieval Augmented Generation
Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal methods in natural language p...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
search again
Self-contrast: Better reflection through incon- sistent solving perspectives. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3602–3622. Association for Computational Linguis- tics. Yujia Zhou, Zheng Liu, Jiajie Jin, Jian-Yun Nie, and Zhicheng Dou. 2024. Metacognitive retrieval- au...
2024
-
[7]
A previous trajectory from an earlier attempt
-
[8]
Important rules: - The previous trajectory may contain mistakes
An external critique of that previous trajectory. Important rules: - The previous trajectory may contain mistakes. - The previous final answer may be wrong. - The external critique may also be wrong. - Do NOT blindly trust the previous trajectory. - Do NOT blindly trust the critique. - Use the critique only as a hint about possible problems to check. - Re...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.