Recognition: no theorem link
RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning
Pith reviewed 2026-05-14 19:33 UTC · model grok-4.3
The pith
A three-stage prompting method lifts LLM judge accuracy from 65% to 79% on hard pairwise comparisons.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RTLC is a three-stage prompting paradigm that applies the Feynman Learning Technique to LLM judging by first wrapping the input in a pedagogical scaffold for research and teaching, then sampling ten candidate verdicts, and finally having the model critique and select one final verdict, resulting in 78.6% pairwise accuracy on JudgeBench-GPT compared to 64.6% with vanilla prompting.
What carries the argument
The RTLC three-stage prompting recipe that ports the Feynman Learning Technique into a fixed scaffold for generating and self-critiquing multiple verdicts.
If this is right
- RTLC achieves higher accuracy than sampling ten responses and taking a majority vote.
- The largest gain comes from the Teach-to-Learn stage.
- Performance improves across knowledge, reasoning, math, and coding evaluation categories.
- RTLC can be combined with score calibration to multiply the accuracy gains.
Where Pith is reading between the lines
- Structured prompting like this may allow smaller models to serve as effective judges in place of larger ones.
- The approach could be tested on other evaluation benchmarks to see if similar gains appear.
- By reducing reliance on human raters, RTLC points toward more scalable AI development cycles.
Load-bearing premise
That the reported gains are specifically due to the RTLC stages and not simply from increased token usage or the particular benchmark setup.
What would settle it
Replicating the JudgeBench evaluation with the same model but using a different prompting method that consumes the same number of tokens and finding comparable accuracy to RTLC.
Figures
read the original abstract
LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RTLC, a three-stage prompting paradigm (Research, Teach-to-Learn, Critique) inspired by the Feynman Learning Technique to improve LLM-as-judge accuracy on JudgeBench without fine-tuning, retrieval, or tools. It reports that Claude 3.7 Sonnet's pairwise accuracy on 350 hard JudgeBench-GPT items rises from 64.6% (vanilla single-shot) to 78.6% (RTLC with N=10 critique), outperforming N=10 self-consistency (77.7%) and zero-shot (74.0%), with a three-step ablation attributing +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalization, and +0.9 pp to explicit critique.
Significance. If the accuracy gains prove reproducible and driven by the specific RTLC structure, the work offers a practical, training-free method to boost LLM judging performance on objective-correctness tasks, with added value from the ablation breakdown, cost-accuracy frontier analysis, and discussion of orthogonal composition with post-hoc calibration. The empirical focus on a public benchmark and stage-wise contributions strengthens its utility for evaluation pipelines in generative NLP.
major comments (2)
- [Method section (stage descriptions)] Method section (stage descriptions): The manuscript supplies only high-level summaries of the three RTLC stages (fixed pedagogical scaffold, N=10 candidates at temperature 0.4, self-critique at temperature 0) without verbatim prompt templates, system instructions, or formatting details. This prevents isolating whether the 14pp gain (64.6% to 78.6%) arises from the intended Feynman-inspired mechanism or from unstated factors such as increased token budget or prompt wording.
- [Results (JudgeBench-GPT evaluation)] Results (JudgeBench-GPT evaluation): The reported accuracies and ablation deltas lack error bars, statistical significance tests, or details on run-to-run variance, which is required to substantiate that the gains over self-consistency (77.7%) are reliable rather than benchmark-specific or model-sensitive artifacts.
minor comments (2)
- [Abstract and §4] The abstract and results refer to 'JudgeBench-GPT (350 hard pairwise items)' without clarifying its exact relation to the public JudgeBench split or selection criteria; adding this detail would aid reproducibility.
- [Appendix] Consider moving the full prompt templates and any additional error-budget tables to an appendix to support the central reproducibility claim.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and commit to revisions that strengthen reproducibility and statistical reporting while preserving the core contributions of the work.
read point-by-point responses
-
Referee: Method section (stage descriptions): The manuscript supplies only high-level summaries of the three RTLC stages (fixed pedagogical scaffold, N=10 candidates at temperature 0.4, self-critique at temperature 0) without verbatim prompt templates, system instructions, or formatting details. This prevents isolating whether the 14pp gain (64.6% to 78.6%) arises from the intended Feynman-inspired mechanism or from unstated factors such as increased token budget or prompt wording.
Authors: We agree that the absence of verbatim prompts limits the ability to attribute gains precisely to the RTLC structure. In the revised manuscript we will add a new appendix containing the complete prompt templates for all three stages, including system instructions, formatting constraints, and temperature settings. This will allow readers to replicate the exact experimental conditions and better isolate the contribution of the pedagogical scaffold versus token budget or wording effects. revision: yes
-
Referee: Results (JudgeBench-GPT evaluation): The reported accuracies and ablation deltas lack error bars, statistical significance tests, or details on run-to-run variance, which is required to substantiate that the gains over self-consistency (77.7%) are reliable rather than benchmark-specific or model-sensitive artifacts.
Authors: We acknowledge that reporting variance and significance tests would strengthen the claims. Because each RTLC configuration incurs substantial API cost, we conducted single runs per setting. In revision we will (i) add a limitations paragraph explaining the single-run design, (ii) report McNemar’s test p-values for the key pairwise comparisons against self-consistency and zero-shot baselines, and (iii) include a small-scale multi-run experiment (N=3) on a 100-item subset to provide indicative error bars. Full multi-run variance on the entire 350-item set remains computationally prohibitive without additional resources. revision: partial
Circularity Check
No circularity: purely empirical prompting study with independent benchmark measurements
full rationale
The manuscript is an empirical evaluation of a prompting recipe on JudgeBench-GPT. It reports direct accuracy measurements (64.6% baseline to 78.6% with RTLC) and ablations (+9.4 pp Teach-to-Learn, +3.7 pp N=10, +0.9 pp critique) without any equations, derivations, fitted parameters, or self-referential definitions. No load-bearing step reduces a claimed result to its own inputs by construction, and no self-citations justify uniqueness theorems or ansatzes. The evaluation is falsifiable via benchmark reproduction and remains self-contained against external measurements.
Axiom & Free-Parameter Ledger
free parameters (3)
- N (number of candidates) =
10
- Candidate generation temperature =
0.4
- Critique temperature =
0
axioms (1)
- domain assumption LLMs can follow complex multi-stage pedagogical instructions to improve judgment quality
Reference graph
Works this paper leans on
-
[1]
Judgebench: A benchmark for evaluating llm-based judges,
S. Tan, S. Zhuang, K. Montgomery, W. Y . Tang, A. Cuadron, C. Wang, R. A. Popa, and I. Stoica, “JudgeBench: A benchmark for evaluating LLM-based judges,” inInternational Conference on Learning Represen- tations (ICLR), 2025, arXiv:2410.12784
-
[2]
Two ways to de-bias an LLM-as-a-Judge: A continuous- score comparison of hierarchical Bayesian calibration and Neural-ODE score transport,
A. Morandi, “Two ways to de-bias an LLM-as-a-Judge: A continuous- score comparison of hierarchical Bayesian calibration and Neural-ODE score transport,” 2026. [Online]. Available: https://arxiv.org/abs/2605. 09227
2026
-
[3]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” inAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023
2023
-
[4]
UltraFeedback: Boosting language models with scaled AI feedback,
G. Cui, L. Yuan, W. Zhu, Z. Liu, and M. Sun, “UltraFeedback: Boosting language models with scaled AI feedback,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 9722–9744. [Online]. Available: https://proceedings.mlr.press/v235/cui24f.html
2024
-
[5]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inInternational Conference on Learning Representations (ICLR), 2023, arXiv:2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Self-Refine: Iterative Refinement with Self-Feedback
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yazdanbakhsh, and P. Clark, “Self-refine: Iterative refine- ment with self-feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2303.17651
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Constitutional AI: Harmlessness from AI feedback,
Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. ...
2022
-
[8]
Improving factuality and reasoning in language models through multiagent debate,
Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learn- ing (ICML), ser. Proceedings of Machine Learning Research, vol. 235, 2024, pp. 11 733–11 763
2024
-
[9]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022, arXiv:2201.11903. 10
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2305.10601
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.