Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?

Zewen Liu

arxiv: 2606.31371 · v1 · pith:MEZRKB2Anew · submitted 2026-06-30 · 💻 cs.LG · cs.AI· cs.CL

Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?

Zewen Liu This is my paper

Pith reviewed 2026-07-01 06:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM agentspreference couplingprobability calibrationevaluator biasfeedback loopsTTRLLLM-as-judge

0 comments

The pith

Probability calibration on an LLM evaluator reduces preference coupling in agent feedback loops by 20-49 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether applying probability calibration to an evaluator's pairwise judgments can lessen how much those judgments bias the strategy an LLM agent learns through feedback. In a within-subjects experiment with five runs, standard binary win/loss updates are compared against calibrated probability-weighted updates using one model as executor and another as evaluator. The calibrated approach lowers the measured coupling coefficient and the divergence between resulting strategy distributions. A separate control with symmetric learning rates shows the reduction is not explained by changes in update balance alone. This positions calibration as a direct adjustment to the feedback signal itself.

Core claim

Applying probability calibration to the evaluator's pairwise judgments reduces the coupling coefficient gamma by 20-49% and Jensen-Shannon divergence by 45-67% compared with standard binary TTRL. The within-subjects design with DeepSeek-V4-Pro as executor and GLM5.2 as evaluator, together with a symmetric-LR control, attributes the reduction to calibration rather than reduced update asymmetry. The study presents the calibrated TTRL protocol as a lightweight mitigation for preference coupling in LLM agent feedback loops.

What carries the argument

probability calibration applied to the evaluator's pairwise judgments to produce probability-weighted updates instead of binary win/loss signals

If this is right

The coupling coefficient gamma decreases by 20-49% when probability calibration replaces binary judgments.
Jensen-Shannon divergence between agent strategy distributions decreases by 45-67% under the same change.
The reduction in coupling persists after applying a symmetric learning rate control.
The calibrated TTRL protocol can be released and used as a lightweight adjustment in LLM-as-judge pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the reduction holds for additional model pairs, calibration could be added as a default preprocessing step for any LLM judge.
The same calibration step might be tested on multi-turn or multi-agent feedback loops where coupling could accumulate over iterations.
Measuring whether lower coupling also improves final task performance would link the distribution metric to downstream outcomes.

Load-bearing premise

The assumption that the symmetric-LR control and the specific choice of DeepSeek-V4-Pro executor and GLM5.2 evaluator isolate the effect of probability calibration from model-specific biases or task distribution.

What would settle it

Repeating the within-subjects comparison on a new pair of models or task set and observing no reduction in gamma or Jensen-Shannon divergence would falsify the mitigation claim.

read the original abstract

When large language model (LLM) agents adapt their behavior through evaluator feedback, systematic evaluator biases propagate into the agent's learned strategy distribution - a phenomenon termed evaluator preference coupling. Prior work has documented this coupling and established a diagnostic framework (EPC) to measure it, but has not investigated whether calibration techniques can mitigate the effect. We present the first study of evaluator calibration as mitigation: applying probability calibration to the evaluator's pairwise judgments to reduce spurious preference propagation. In a controlled within-subjects experiment (N=5) comparing standard binary TTRL (win/loss) with confidence-calibrated TTRL (probability-weighted updates) using DeepSeek-V4-Pro as executor and GLM5.2 as evaluator, we find that calibration reduces the coupling coefficient gamma by 20-49% and Jensen-Shannon divergence by 45-67%. A symmetric-LR control confirms the effect is not due to reduced update asymmetry. We release the calibrated TTRL protocol and recommend it as a lightweight mitigation for LLM-as-judge deployment pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Calibration cuts the reported coupling metrics in an N=5 within-subjects test, but the absence of variance estimates or tests leaves the mitigation claim preliminary.

read the letter

The paper's main contribution is testing probability calibration as a fix for evaluator preference coupling in LLM agent loops. Earlier work had only measured the problem; this one runs the first mitigation experiment by switching from binary win/loss updates to probability-weighted ones.

The setup uses a within-subjects design with DeepSeek-V4-Pro as the executor and GLM5.2 as the evaluator, plus a symmetric-LR control to rule out simple asymmetry effects. They report reductions of 20-49% in the coupling coefficient gamma and 45-67% in Jensen-Shannon divergence across the five paired runs, and they release the protocol.

The clear limitation is the scale. Five runs without reported standard deviations, confidence intervals, or hypothesis tests means the effect sizes could easily reflect task or model idiosyncrasies rather than a stable calibration benefit. Only two models are tested, so generalizability is unproven. The stress-test concern about sampling noise holds up on the information given.

This is aimed at people already running self-feedback loops in coding or decision agents who want a low-cost tweak. A reader looking for a practical starting point on the mitigation side could use the released protocol as a baseline.

I would send it for peer review. The question is worth asking and the control is sensible, but the current numbers need more replication and statistical grounding before the result can be treated as reliable.

Referee Report

2 major / 1 minor

Summary. The paper claims that applying probability calibration to an LLM evaluator's pairwise judgments in a TTRL feedback loop reduces evaluator preference coupling, as measured by the coupling coefficient gamma (20-49% reduction) and Jensen-Shannon divergence (45-67% reduction). This is demonstrated in a within-subjects experiment (N=5) comparing standard binary TTRL against confidence-calibrated TTRL, using DeepSeek-V4-Pro as executor and GLM5.2 as evaluator, with a symmetric-LR control to rule out update asymmetry.

Significance. If the quantitative mitigation effect is robustly confirmed, the work offers a lightweight, practical intervention for reducing bias propagation in LLM-as-judge pipelines, extending the EPC diagnostic framework with an actionable calibration protocol. The release of the calibrated TTRL protocol is a positive contribution for reproducibility.

major comments (2)

[Abstract / Experimental Results] Abstract and Experimental Results section: The central quantitative claims rest on effect-size ranges (gamma reduced 20-49%, JSD 45-67%) from only N=5 within-subjects paired trials, yet no per-condition standard deviations, confidence intervals, bootstrap estimates, or hypothesis tests are reported. This directly undermines the reliability of the mitigation percentages as evidence of a stable calibration effect rather than sampling variability or task-specific noise.
[Experimental Setup] Experimental design description: The symmetric-LR control and fixed choice of two models (DeepSeek-V4-Pro executor, GLM5.2 evaluator) are presented as isolating the calibration effect, but with N=5 and no cross-model or cross-task replication, it remains unclear whether the observed reductions generalize beyond these specific model idiosyncrasies and task distribution.

minor comments (1)

[Abstract] The abstract states 'we release the calibrated TTRL protocol' but does not specify the repository URL or license in the provided text; this should be added for immediate usability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, acknowledging limitations where appropriate while defending the controlled nature of the study.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: The central quantitative claims rest on effect-size ranges (gamma reduced 20-49%, JSD 45-67%) from only N=5 within-subjects paired trials, yet no per-condition standard deviations, confidence intervals, bootstrap estimates, or hypothesis tests are reported. This directly undermines the reliability of the mitigation percentages as evidence of a stable calibration effect rather than sampling variability or task-specific noise.

Authors: We agree that the reported ranges would benefit from explicit measures of uncertainty. The within-subjects paired design was selected to control for task and model variance across the N=5 trials, but we acknowledge the small sample precludes strong claims of stability. In the revision we will add bootstrap confidence intervals and per-condition standard deviations computed from the raw trial data to quantify variability. revision: yes
Referee: [Experimental Setup] Experimental design description: The symmetric-LR control and fixed choice of two models (DeepSeek-V4-Pro executor, GLM5.2 evaluator) are presented as isolating the calibration effect, but with N=5 and no cross-model or cross-task replication, it remains unclear whether the observed reductions generalize beyond these specific model idiosyncrasies and task distribution.

Authors: The symmetric-LR control was introduced precisely to isolate calibration from update asymmetry, and the model pair was fixed to enable a clean within-subjects comparison. We do not claim the reductions hold universally; the experiment demonstrates the mitigation effect under these controlled conditions. We will expand the limitations and future-work sections to note the absence of cross-model or cross-task replication. revision: partial

Circularity Check

0 steps flagged

Empirical experiment reports measured reductions with no derivation chain

full rationale

The paper describes a within-subjects experiment (N=5) that directly measures the effect of probability calibration on coupling coefficient gamma and Jensen-Shannon divergence. No equations, fitted parameters, or self-citations are used to derive the reported percentage reductions; the values are presented as experimental outcomes. The work is therefore self-contained and contains no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the EPC diagnostic from prior work, the assumption that probability calibration can be applied to pairwise LLM judgments without introducing new artifacts, and the representativeness of the two models and task distribution used in the N=5 experiment.

axioms (2)

domain assumption The EPC diagnostic framework from prior work accurately quantifies preference coupling in LLM feedback loops.
The paper uses the EPC metric to measure the mitigation effect without re-deriving or validating it in this study.
domain assumption Pairwise judgments from the evaluator model can be meaningfully calibrated to probabilities that reduce spurious preference propagation.
This is the core premise of the mitigation technique tested in the experiment.

pith-pipeline@v0.9.1-grok · 5712 in / 1559 out tokens · 55461 ms · 2026-07-01T06:01:47.564434+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Z. Liu. A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics. TMLR submission, 2026

2026
[2]

Z. Liu. Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems. arXiv:2606.20493, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Z. Liu. Multimodal Evaluator Preference Collapse. arXiv:2606.16682, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Y. Li. Who Drifted: the System or the Judge? arXiv:2606.15474, 2026

work page arXiv 2026
[5]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS, 2023

2023
[6]

Chiang, L

W.-L. Chiang, L. Zheng, et al. Chatbot Arena. ICML, 2024

2024
[7]

C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. ICML, 2017

2017
[8]

Niculescu-Mizil and R

A. Niculescu-Mizil and R. Caruana. Predicting Good Probabilities with Supervised Learning. ICML, 2005

2005
[9]

Bostr \"o m

H. Bostr \"o m. Calibrating Random Forests. ICMLA, 2008

2008
[10]

Grinsztajn, E

L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on tabular data? NeurIPS, 2022

2022
[11]

Z. Li, X. Li, C. Huang, G. Li, et al. Judging with Confidence: Calibrating Autoraters to Preference Distributions. arXiv:2510.00263, 2025

work page arXiv 2025
[12]

J. Leng, C. Huang, B. Zhu, and J. Huang. Taming Overconfidence in LLMs: Reward Calibration in RLHF. ICLR, 2025. arXiv:2410.09724

work page arXiv 2025
[13]

D. Singha. UARD: Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking. arXiv:2604.26360, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Devarakonda, J

S. Devarakonda, J. Huang, and P. Liang. Confidence-Gated RAG for Adaptive Retrieval in Sequential Agents. ICLR, 2026

2026
[15]

Balashankar, S

A. Balashankar, S. Chen, and J. Yao. InfAlign: Inference-Aware Language Model Alignment. NeurIPS, 2025

2025
[16]

Z. Zuo, Y. Wang, and J. Li. TTRL-CoCoV: Test-Time Reinforcement Learning with Confidence Conditioned Verification. arXiv, 2026

2026
[17]

Y. Wang, X. Zhang, and H. Chen. SCOPE: Beyond Majority Voting---Step-wise Confidence Weighting for Test-Time RL. arXiv:2512.15146, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Z. Liu. A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics. TMLR submission, 2026

2026

[2] [2]

Z. Liu. Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems. arXiv:2606.20493, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Z. Liu. Multimodal Evaluator Preference Collapse. arXiv:2606.16682, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Y. Li. Who Drifted: the System or the Judge? arXiv:2606.15474, 2026

work page arXiv 2026

[5] [5]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS, 2023

2023

[6] [6]

Chiang, L

W.-L. Chiang, L. Zheng, et al. Chatbot Arena. ICML, 2024

2024

[7] [7]

C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. ICML, 2017

2017

[8] [8]

Niculescu-Mizil and R

A. Niculescu-Mizil and R. Caruana. Predicting Good Probabilities with Supervised Learning. ICML, 2005

2005

[9] [9]

Bostr \"o m

H. Bostr \"o m. Calibrating Random Forests. ICMLA, 2008

2008

[10] [10]

Grinsztajn, E

L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on tabular data? NeurIPS, 2022

2022

[11] [11]

Z. Li, X. Li, C. Huang, G. Li, et al. Judging with Confidence: Calibrating Autoraters to Preference Distributions. arXiv:2510.00263, 2025

work page arXiv 2025

[12] [12]

J. Leng, C. Huang, B. Zhu, and J. Huang. Taming Overconfidence in LLMs: Reward Calibration in RLHF. ICLR, 2025. arXiv:2410.09724

work page arXiv 2025

[13] [13]

D. Singha. UARD: Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking. arXiv:2604.26360, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Devarakonda, J

S. Devarakonda, J. Huang, and P. Liang. Confidence-Gated RAG for Adaptive Retrieval in Sequential Agents. ICLR, 2026

2026

[15] [15]

Balashankar, S

A. Balashankar, S. Chen, and J. Yao. InfAlign: Inference-Aware Language Model Alignment. NeurIPS, 2025

2025

[16] [16]

Z. Zuo, Y. Wang, and J. Li. TTRL-CoCoV: Test-Time Reinforcement Learning with Confidence Conditioned Verification. arXiv, 2026

2026

[17] [17]

Y. Wang, X. Zhang, and H. Chen. SCOPE: Beyond Majority Voting---Step-wise Confidence Weighting for Test-Time RL. arXiv:2512.15146, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026