Calibrated Preference Learning: The Case of Label Ranking

Eyke H\"ullermeier; Santo M. A. R. Thies; Sebastian J. Vollmer; Timo Kaufmann; Viktor Bengs

arxiv: 2605.30447 · v1 · pith:ATZK5E7Gnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· stat.ML

Calibrated Preference Learning: The Case of Label Ranking

Santo M. A. R. Thies , Viktor Bengs , Timo Kaufmann , Sebastian J. Vollmer , Eyke H\"ullermeier This is my paper

Pith reviewed 2026-06-29 08:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords label rankingcalibrationprobabilistic rankingpreference learningtop-k rankingRLHFreward models

0 comments

The pith

Calibration for label ranking is formalized as a hierarchy where full ranking calibration implies sub-ranking and top-k versions but not the reverse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines what it means for a probabilistic label ranking model to be calibrated, meaning its predicted distribution over possible orderings matches the true frequencies of those orderings. It creates a hierarchy of calibration definitions that apply to complete rankings, partial sub-rankings, and top-k selections, and proves the implication relations among them. This structure matters because simply treating rankings as ordinary classes loses the information in pairwise comparisons and top choices that decision systems actually use. Experiments show that standard label ranking methods are typically miscalibrated under these measures, with noticeable gaps between the different levels of the hierarchy. When applied to reward models from RLHF, calibration tracks benchmark accuracy closely but not perfectly, indicating it measures an independent aspect of model quality.

Core claim

We formalize calibration for label ranking and develop a hierarchy of notions covering full rankings, sub-rankings, and top-k rankings. We prove that full-rank calibration implies the others but not conversely, and sub-ranking and top-k calibration are incomparable. Empirically, popular label ranking models are often poorly calibrated, with substantial differences between sub-ranking and top-k metrics. Applying the framework to RLHF reward models shows that calibration correlates strongly but not perfectly with benchmark accuracy.

What carries the argument

Hierarchy of calibration notions for distributions over rankings, with full-rank, sub-ranking, and top-k levels and proved implication relations.

If this is right

Full-rank calibration is sufficient to guarantee both sub-ranking and top-k calibration.
Sub-ranking calibration does not guarantee top-k calibration and vice versa.
Common label ranking algorithms exhibit poor calibration under these definitions.
In RLHF reward models, calibration provides information about quality that is distinct from top-1 accuracy.
Miscalibration can differ in degree across the hierarchy levels for the same model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Downstream tasks that rely on pairwise comparisons may still suffer from errors even when top-k calibration holds.
Calibration correction techniques developed for classification could be adapted to ranking distributions.
Evaluation benchmarks for preference learning might usefully add calibration checks at multiple hierarchy levels.
Models trained only to maximize accuracy could be improved by adding explicit calibration objectives at the sub-ranking level.

Load-bearing premise

The defined calibration notions for full, sub, and top-k rankings correctly capture the structure that matters for reliable decisions based on predicted orderings.

What would settle it

A dataset or model where a predictor meets full-rank calibration yet produces miscalibrated pairwise probabilities or top-k selections on held-out data.

Figures

Figures reproduced from arXiv: 2605.30447 by Eyke H\"ullermeier, Santo M. A. R. Thies, Sebastian J. Vollmer, Timo Kaufmann, Viktor Bengs.

**Figure 1.** Figure 1: Overview of the calibration definitions and their relationships (an outgoing arrow means implication, whereas dashed only hold for specific model classes). Exclusion results are depicted in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Rankwise sub-2,sub-3,top-2, and top-3 ECE on political and movies, considering only the 95% most occurring rankings. average predicted probability in bin b. Similarly, for rankwise sub-k and top-k calibration, we compute the ECEρ for a sub-ranking ρ of size k and average over all sub-rankings P, T . We refer to Appendix E.1 for further details and critical discussion. 5.1. Calibration of Label Ranking Mod… view at source ↗

**Figure 3.** Figure 3: Top-1 ECE results and their correlation with RewardBench2 categories. differences across categories. Focus3 and Safety exhibit the strongest correlations, whereas Math and Precise If4 exhibit the weakest. One possible explanation is that Focus and Safety allow for a more gradual notion of correctness, enabling model confidence to better track response quality. In contrast, Math and Precise If are closer to… view at source ↗

**Figure 4.** Figure 4: Overview of the calibration definitions and their exclusiveness relationships. Theorem A.1. Let X consist of at least two elements. If there exist two data points xi , xj ∈ X with i ̸= j such that for two rankings π1, π2 ∈ SI it holds that h(xi)[π1] = h(xj )[π1] ∧ h(xi)[π2] = h(xj )[π2] and P(X = xi | h(xi)[α] = α) = P(X = xj | h(xj )[α] = α) for some α ∈ [0, 1], then it holds: ∃h ∈ X → P(SI) : h rankwise … view at source ↗

**Figure 5.** Figure 5: Comparison of using | · | in contrast to Jeffrey Divergence for calculating sub-k calibration via ECE. Rank correlation of the methods is reported only if the values using | · | are significantly different. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

**Figure 6.** Figure 6: Calibration of label ranking models for the “movies” dataset, considering only the rankings covering 95% of the probability mass. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗

**Figure 7.** Figure 7: Calibration of label ranking models for the “iris” dataset, considering only the rankings covering 95% of the probability mass. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

**Figure 8.** Figure 8: Calibration of label ranking models for the “authorship” dataset, considering only the rankings covering 95% of the probability mass. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

**Figure 9.** Figure 9: Calibration of label ranking models for the “political” dataset, considering only the rankings covering 95% of the probability mass. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗

**Figure 10.** Figure 10: Calibration of label ranking models for the “vehicle” dataset, considering only the rankings covering 95% of the probability mass. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: Calibration of label ranking models for the “segment” dataset, considering only the rankings covering 95% of the probability mass. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

**Figure 12.** Figure 12: Calibration of label ranking models for the “vowel” dataset, considering only the rankings covering 95% of the probability mass. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗

**Figure 13.** Figure 13: Calibration of label ranking models for the “wine” dataset, considering only the rankings covering 95% of the probability mass. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗

**Figure 14.** Figure 14: Calibration of label ranking models for the “yeast” dataset, considering only the rankings covering 95% of the probability mass. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗

**Figure 15.** Figure 15: Calibration of label ranking models for the “glass” dataset, considering only the rankings covering 95% of the probability mass. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_15.png] view at source ↗

read the original abstract

Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decision-making. While extensively studied for classification and regression, calibration has not been formally addressed for probabilistic label ranking, where the goal is to predict a distribution over orderings of a label set. Naively treating rankings as classes ignores their structure and fails to capture important modalities such as pairwise and top-k predictions. We formalize calibration for label ranking and develop a hierarchy of notions covering full rankings, sub-rankings, and top-k rankings. We prove that full-rank calibration implies the others but not conversely, and sub-ranking and top-k calibration are incomparable. Empirically, we find popular label ranking models are often poorly calibrated, with substantial differences between sub-ranking and top-k metrics. Applying our framework to RLHF reward models, we find that calibration correlates strongly but not perfectly with benchmark accuracy, suggesting it captures a meaningful quality dimension beyond top-1 accuracy. These findings motivate future work on understanding the downstream effects of miscalibration and developing methods to correct it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes calibration notions for label ranking with a clear implication hierarchy and checks existing models for miscalibration.

read the letter

The main takeaway is a formal hierarchy of calibration for probabilistic label ranking: full-rank, sub-ranking, and top-k versions, with proofs that full implies the others but not conversely, and that the two weaker ones are incomparable.

The definitions treat rankings as structured rather than flat classes, which avoids the naive approach the abstract criticizes. The proofs appear to follow from marginalization and counterexamples once the notions are set up, so the relations hold by design. The RLHF application is a timely angle, showing calibration tracks accuracy but not perfectly.

The empirical claims about popular models being poorly calibrated and metric differences are stated directly. Without numbers or setup details in the abstract it is hard to gauge effect sizes, but the direction aligns with known calibration issues in other structured outputs.

A soft spot is whether these particular notions match the actual needs in downstream ranking decisions; the paper assumes they do without much extra justification. The work stays within standard calibration ideas applied to a new output type, with no obvious circularity.

This is for people working on label ranking, preference models, or calibration in structured prediction. A reader evaluating or improving probabilistic rankers would get a usable framework and some evidence that miscalibration is common.

It deserves a serious referee because the formal extension is new and the RLHF link is relevant. Send it out.

Referee Report

0 major / 1 minor

Summary. The manuscript claims to formalize calibration for probabilistic label ranking by developing a hierarchy of notions covering full rankings, sub-rankings, and top-k rankings. It proves that full-rank calibration implies the others but not conversely, and that sub-ranking and top-k calibration are incomparable. Empirically, popular label ranking models are often poorly calibrated with substantial differences between sub-ranking and top-k metrics; applying the framework to RLHF reward models shows calibration correlates strongly but not perfectly with benchmark accuracy.

Significance. If the formalization and proofs hold, the work supplies a structured hierarchy for assessing calibration in label ranking, addressing a gap beyond standard classification. The explicit proofs of the implication and incomparability results, together with the empirical demonstration of miscalibration in existing models and the partial correlation with accuracy in RLHF, constitute a meaningful contribution that could motivate downstream calibration methods for ranking tasks.

minor comments (1)

The abstract asserts the existence of proofs and empirical results without supplying derivations, data details, or verification steps; expanding the abstract or adding a short methods overview would improve accessibility while preserving the high-level claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our contribution and for recommending minor revision. No specific major comments appear in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution consists of new formal definitions for calibration notions (full-rank, sub-ranking, top-k) in label ranking, followed by standard mathematical proofs establishing implication and incomparability relations via marginalization and counterexamples. These steps are self-contained constructions that do not reduce any claimed result to a fitted parameter, self-citation chain, or input by definition; the definitions are introduced explicitly to enable the stated hierarchy rather than presupposing it. Empirical sections evaluate existing models against the new metrics but do not feed back into the theoretical claims. No load-bearing self-citation or ansatz smuggling is present in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; all arrays are empty due to lack of detail.

pith-pipeline@v0.9.1-grok · 5732 in / 1110 out tokens · 34548 ms · 2026-06-29T08:51:25.783360+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Alfaro, J

doi: 10.1016/j.fss.2024.108908. Alfaro, J. C., Aledo, J. A., and G ´amez, J. A. Learning decision trees for the partial label ranking problem.In- ternational Journal of Intelligent Systems, 36:890–918,

work page doi:10.1016/j.fss.2024.108908 2024
[2]

Alfaro, J

doi: 10.1002/int.22325. Alfaro, J. C., Aledo, J. A., and G´amez, J. A. Pairwise learn- ing for the partial label ranking problem.Pattern Recogni- tion, 140:109590, 2023a. doi: 10.1016/J.PATCOG.2023. 109590. Alfaro, J. C., Aledo, J. A., and G ´amez, J. A. Pairwise learning for the partial label ranking problem.Pattern Recognition, 140:109590, 2023b. doi: 1...

work page doi:10.1002/int.22325 2023
[3]

H¨ullermeier, E., F¨urnkranz, J., Cheng, W., and Brinker, K

doi: 10.1007/S10994-021-05946-3. H¨ullermeier, E., F¨urnkranz, J., Cheng, W., and Brinker, K. Label ranking by learning pairwise preferences.Artificial Intelligence, 172(16-17):1897–1916, 2008. doi: 10.1016/ J.ARTINT.2008.08.002. Hunter, D. R. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004. J¨urgens, M., M...

work page doi:10.1007/s10994-021-05946-3 1916
[4]

Malik, S., Pyatkin, V ., Land, S., Morrison, J., Smith, N

doi: 10.1145/2783258.2788582. Malik, S., Pyatkin, V ., Land, S., Morrison, J., Smith, N. A., Hajishirzi, H., and Lambert, N. Rewardbench 2: Advanc- ing reward model evaluation.CoRR, abs/2506.01937,

work page doi:10.1145/2783258.2788582
[5]

RewardBench 2: Advancing Reward Model Evaluation

doi: 10.48550/ARXIV .2506.01937. Mallows, C. L. Non-null ranking models. i.Biometrika, 44 (1/2):114–130, 1957. McLean, I., Urken, A. B., and Hewitt, F.Classics of social choice. University of Michigan Press, 1995. Menon, A. K., Jiang, X., Vembu, S., Elkan, C., and Ohno- Machado, L. Predicting accurate probabilities with a ranking loss. InProceedings of th...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 1957
[6]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C

doi: 10.1007/978-3-031-78977-9\ 26. Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. D. Just ask for calibra- tion: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EML...

work page doi:10.1007/978-3-031-78977-9 2023
[7]

choosing someq∈P(S B)arbitrary but fixed, take someα∈[0,1]arbitrary but fixed,
[8]

choosinghto be full-rank calibrated, use anhthat is rankwise calibrated,
[9]

conditioning onh ||P(X) =q, condition onh ||P(X)[ρ] =α,
[10]

conditioning onh ||T (X) =q, condition onh ||T (X)[ρ] =α,
[11]

How conservative would you rate CDU/CSU?

replace each appearance ofq[ρ]byα. Theorem A.10.Assume that the number of itemsm≥3. Then it holds: ∀k < m:∃h∈ X →P(S I) :hrankwise top-k calibrated∧hnotrankwise sub-k calibrated . Proof of Theorem A.10.This proof is analogous to Theorem A.9, replacingPforT. Corollary A.11.Assume that the number of itemsm≥3. Then it holds: ∀k∈ {2, . . . , m−1} ∃h∈ X →P(S I...

2008
[12]

authorship

with learning rate10 −3 and no weight decay. On each dataset, we train for 50 epochs using a batch size of 64. RankClassifierRankClassifier is a multi-class classification model that outputs a probability distribution over all possible rankings SI. Naively training such a model is infeasible for large m, since the number of rankings grows factorially as m...

2023

[1] [1]

Alfaro, J

doi: 10.1016/j.fss.2024.108908. Alfaro, J. C., Aledo, J. A., and G ´amez, J. A. Learning decision trees for the partial label ranking problem.In- ternational Journal of Intelligent Systems, 36:890–918,

work page doi:10.1016/j.fss.2024.108908 2024

[2] [2]

Alfaro, J

doi: 10.1002/int.22325. Alfaro, J. C., Aledo, J. A., and G´amez, J. A. Pairwise learn- ing for the partial label ranking problem.Pattern Recogni- tion, 140:109590, 2023a. doi: 10.1016/J.PATCOG.2023. 109590. Alfaro, J. C., Aledo, J. A., and G ´amez, J. A. Pairwise learning for the partial label ranking problem.Pattern Recognition, 140:109590, 2023b. doi: 1...

work page doi:10.1002/int.22325 2023

[3] [3]

H¨ullermeier, E., F¨urnkranz, J., Cheng, W., and Brinker, K

doi: 10.1007/S10994-021-05946-3. H¨ullermeier, E., F¨urnkranz, J., Cheng, W., and Brinker, K. Label ranking by learning pairwise preferences.Artificial Intelligence, 172(16-17):1897–1916, 2008. doi: 10.1016/ J.ARTINT.2008.08.002. Hunter, D. R. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004. J¨urgens, M., M...

work page doi:10.1007/s10994-021-05946-3 1916

[4] [4]

Malik, S., Pyatkin, V ., Land, S., Morrison, J., Smith, N

doi: 10.1145/2783258.2788582. Malik, S., Pyatkin, V ., Land, S., Morrison, J., Smith, N. A., Hajishirzi, H., and Lambert, N. Rewardbench 2: Advanc- ing reward model evaluation.CoRR, abs/2506.01937,

work page doi:10.1145/2783258.2788582

[5] [5]

RewardBench 2: Advancing Reward Model Evaluation

doi: 10.48550/ARXIV .2506.01937. Mallows, C. L. Non-null ranking models. i.Biometrika, 44 (1/2):114–130, 1957. McLean, I., Urken, A. B., and Hewitt, F.Classics of social choice. University of Michigan Press, 1995. Menon, A. K., Jiang, X., Vembu, S., Elkan, C., and Ohno- Machado, L. Predicting accurate probabilities with a ranking loss. InProceedings of th...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 1957

[6] [6]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C

doi: 10.1007/978-3-031-78977-9\ 26. Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. D. Just ask for calibra- tion: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EML...

work page doi:10.1007/978-3-031-78977-9 2023

[7] [7]

choosing someq∈P(S B)arbitrary but fixed, take someα∈[0,1]arbitrary but fixed,

[8] [8]

choosinghto be full-rank calibrated, use anhthat is rankwise calibrated,

[9] [9]

conditioning onh ||P(X) =q, condition onh ||P(X)[ρ] =α,

[10] [10]

conditioning onh ||T (X) =q, condition onh ||T (X)[ρ] =α,

[11] [11]

How conservative would you rate CDU/CSU?

replace each appearance ofq[ρ]byα. Theorem A.10.Assume that the number of itemsm≥3. Then it holds: ∀k < m:∃h∈ X →P(S I) :hrankwise top-k calibrated∧hnotrankwise sub-k calibrated . Proof of Theorem A.10.This proof is analogous to Theorem A.9, replacingPforT. Corollary A.11.Assume that the number of itemsm≥3. Then it holds: ∀k∈ {2, . . . , m−1} ∃h∈ X →P(S I...

2008

[12] [12]

authorship

with learning rate10 −3 and no weight decay. On each dataset, we train for 50 epochs using a batch size of 64. RankClassifierRankClassifier is a multi-class classification model that outputs a probability distribution over all possible rankings SI. Naively training such a model is infeasible for large m, since the number of rankings grows factorially as m...

2023