arxiv: 2604.21923 · v1 · submitted 2026-04-23 · 💻 cs.LG · math.ST· stat.ML· stat.TH

Recognition: unknown

The Sample Complexity of Multicalibration

Natalie Collina , Jiuyao Lu , Georgy Noarov , Aaron Roth

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:02 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.MLstat.TH

keywords multicalibrationsample complexityexpected calibration errorbatch learningonline-to-batchelicitable propertiesfairness

0 comments

The pith

Multicalibration to error ε requires roughly ε^{-3} samples when the number of groups is polynomial in 1/ε.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper determines the minimax number of i.i.d. samples needed to output a predictor whose population multicalibration error, measured by expected calibration error across a given group family G, is at most ε. For any fixed κ > 0, when |G| is at most ε^{-κ}, both a matching upper bound via online-to-batch conversion and a lower bound that holds even for randomized predictors establish that the complexity is Θ̃(ε^{-3}). This rate exceeds the Θ̃(ε^{-2}) needed for marginal calibration alone and remains unchanged from the online setting, unlike marginal calibration which improves in the batch regime. When κ = 0 and the number of groups is constant, the complexity drops sharply back to Θ̃(ε^{-2}). The analysis extends to weighted L_p multicalibration metrics for p between 1 and 2 with optimal exponent 3/p and to other elicitable properties such as expectiles and bounded-density quantiles.

Core claim

For every fixed κ > 0 and any group family G with |G| ≤ ε^{-κ}, the sample complexity of multicalibration to ECE ε is Θ̃(ε^{-3}) in the batch setting. The upper bound is realized by a randomized predictor obtained from an online-to-batch reduction, while the lower bound holds even for randomized predictors. Matching upper and lower bounds, up to polylog factors, also hold for weighted L_p multicalibration with exponent 3/p, and the lower-bound template extends to a regular class of elicitable properties to obtain matching bounds for expectiles and bounded-density quantiles.

What carries the argument

The online-to-batch reduction that turns an online multicalibration algorithm into a batch predictor with the claimed sample complexity, together with a new lower-bound template for regular elicitable properties that produces the matching hardness result.

Load-bearing premise

The family of groups is fixed in advance and its size is bounded by a polynomial in 1/ε.

What would settle it

An explicit algorithm that achieves multicalibration error ε with o(ε^{-3}) samples on every distribution when |G| = ε^{-1}, or a concrete distribution and group family of size ε^{-1} on which every predictor requires at least ε^{-3} samples.

Figures

Figures reproduced from arXiv: 2604.21923 by Aaron Roth, Georgy Noarov, Jiuyao Lu, Natalie Collina.

read the original abstract

We study the minimax sample complexity of multicalibration in the batch setting. A learner observes $n$ i.i.d. samples from an unknown distribution and must output a (possibly randomized) predictor whose population multicalibration error, measured by Expected Calibration Error (ECE), is at most $\varepsilon$ with respect to a given family of groups. For every fixed $\kappa > 0$, in the regime $|G|\le \varepsilon^{-\kappa}$, we prove that $\widetilde{\Theta}(\varepsilon^{-3})$ samples are necessary and sufficient, up to polylogarithmic factors. The lower bound holds even for randomized predictors, and the upper bound is realized by a randomized predictor obtained via an online-to-batch reduction. This separates the sample complexity of multicalibration from that of marginal calibration, which scales as $\widetilde{\Theta}(\varepsilon^{-2})$, and shows that mean-ECE multicalibration is as difficult in the batch setting as it is in the online setting, in contrast to marginal calibration which is strictly more difficult in the online setting. In contrast we observe that for $\kappa = 0$, the sample complexity of multicalibration remains $\widetilde{\Theta}(\varepsilon^{-2})$ exhibiting a sharp threshold phenomenon. More generally, we establish matching upper and lower bounds, up to polylogarithmic factors, for a weighted $L_p$ multicalibration metric for all $1 \le p \le 2$, with optimal exponent $3/p$. We also extend the lower-bound template to a regular class of elicitable properties, and combine it with the online upper bounds of Hu et al. (2025) to obtain matching bounds for calibrating properties including expectiles and bounded-density quantiles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins down multicalibration sample complexity at ~ε^{-3} for polynomially many groups, with a clean drop to ε^{-2} when groups are constant in number.

read the letter

The central result is that multicalibration needs Θ̃(ε^{-3}) samples when |G| is at most ε^{-κ} for any fixed κ > 0, while marginal calibration stays at Θ̃(ε^{-2}). At κ = 0 the multicalibration rate falls back to the easier one, which is a sharp threshold. The lower bound holds for randomized predictors and the upper bound comes from an online-to-batch reduction. They also get matching rates for weighted L_p multicalibration with exponent 3/p and extend the lower-bound idea to expectiles and bounded-density quantiles. These rates and the separation were not known before. The work is direct about what changes between marginal and multi-group calibration and between batch and online settings. The reduction reuses existing online results cleanly, and the lower-bound template looks reusable for other elicitable properties. The analysis stays within the regime where G is fixed in advance and polynomially bounded, which matches the stated claims. One minor limitation is that the bounds carry polylog factors and the paper does not address data-dependent group selection, but that is outside the scope they set. The math appears consistent on its own terms with no obvious circularity or unsupported steps. This is useful for anyone working on calibration rates, subgroup fairness, or online-to-batch conversions in learning theory. It gives concrete guidance on when multicalibration is strictly harder than marginal calibration. I would send it to peer review; the matching bounds and threshold are worth referee attention even if some constants need tightening.

Referee Report

0 major / 3 minor

Summary. The paper studies the minimax sample complexity of achieving ε-multicalibration (via ECE) w.r.t. a fixed group family G in the batch setting with i.i.d. samples. For any fixed κ>0 and |G|≤ε^{-κ}, it proves matching upper and lower bounds of ãΘ(ε^{-3}) (up to polylog factors), with the upper bound via online-to-batch reduction yielding a randomized predictor and the lower bound holding even for randomized predictors. It also gives matching bounds for weighted L_p multicalibration (optimal exponent 3/p for 1≤p≤2) and extends the lower-bound template to a regular class of elicitable properties, recovering matching rates for expectiles and bounded-density quantiles. For κ=0 the rate drops to ãΘ(ε^{-2}), matching marginal calibration and exhibiting a sharp threshold.

Significance. If the claimed matching bounds hold, the result is significant: it sharply separates multicalibration sample complexity from marginal calibration, demonstrates equivalence between batch and online settings for multicalibration (in contrast to marginal calibration), and introduces a reusable lower-bound template for elicitable properties. The phase transition at κ=0 and the handling of randomized predictors are notable. The paper supplies matching upper/lower bounds, an online-to-batch reduction, and extensions beyond the core claim; these are concrete strengths.

minor comments (3)

[Abstract] Abstract: the parenthetical reference to Hu et al. (2025) for the online upper bounds should be expanded to a one-sentence comparison of the online rate to the new batch rate, to make the claimed separation immediate for readers.
[Lower bound section] The lower-bound construction (presumably in the section containing the minimax argument) should explicitly state whether the groups in G are required to be measurable with respect to the sigma-algebra generated by the predictor or are allowed to be arbitrary; this affects the scope of the ãΘ(ε^{-3}) claim.
[Preliminaries or §2] Notation: the weighted L_p multicalibration metric is introduced without an equation number in the abstract; adding a displayed definition early in the paper would improve readability when the 3/p exponent is later invoked.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our paper, as well as the recommendation for minor revision. The referee's description correctly reflects our main results on the minimax sample complexity of multicalibration, the phase transition at κ=0, the online-to-batch reduction, and the extensions to weighted L_p metrics and elicitable properties. Since the report lists no specific major comments, we have no points requiring point-by-point rebuttal or changes at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation establishes matching upper and lower bounds via an online-to-batch reduction to independent prior work (Hu et al. 2025) for the upper bound and a direct minimax argument for the lower bound. No load-bearing step reduces by construction to a fitted input, self-definition, or self-citation chain; the claimed Θ̃(ε^{-3}) rate (for |G| ≤ ε^{-κ}) is proven from first principles within the stated assumptions on fixed G. The paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard i.i.d. batch assumption, the existence of online multicalibration algorithms from prior work, and the definition of ECE multicalibration error; no free parameters or new entities are introduced.

axioms (2)

domain assumption Samples are drawn i.i.d. from an unknown distribution
Standard assumption for the batch minimax setting stated in the abstract.
domain assumption Online multicalibration algorithms exist with known regret bounds
Invoked for the upper bound via online-to-batch conversion.

pith-pipeline@v0.9.0 · 5624 in / 1290 out tokens · 34069 ms · 2026-05-09T23:02:56.011976+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Instance-Adaptive Online Multicalibration
cs.LG 2026-05 conditional novelty 8.0

A single online multicalibration algorithm adaptively refines a dyadic grid and achieves instance-dependent rates: O(T^{2/3}) worst-case, O(sqrt T) for marginal stochastic data, and O(sqrt(JT)) for J-piecewise station...

Reference graph

Works this paper leans on

15 extracted references · 5 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Collaborative prediction: Tractable information aggregation via agreement

[CGHG+26] Natalie Collina, Ira Globus-Harris, Surbhi Goel, Varun Gupta, Aaron Roth, and Mi- rah Shi. Collaborative prediction: Tractable information aggregation via agreement. InProceedings of the 2026 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM,

2026
[2]

Optimal Lower Bounds for Online Multicalibration

[CLNR26] Natalie Collina, Jiuyao Lu, Georgy Noarov, and Aaron Roth. Optimal lower bounds for online multicalibration.arXiv preprint arXiv:2601.05245,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Breaking theT 2/3 barrier for sequential calibra- tion

[DDF+25] Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich, Robert Kleinberg, and Princewill Okoroafor. Breaking theT 2/3 barrier for sequential calibra- tion. InProceedings of the 57th Annual ACM Symposium on Theory of Computing, pages 2007–2018. ACM,

2007
[4]

Happymap: A generalized mul- ticalibration method

[DDZ23] Zhun Deng, Cynthia Dwork, and Linjun Zhang. Happymap: A generalized mul- ticalibration method. In14th Innovations in Theoretical Computer Science Confer- ence (ITCS 2023), page 41:1–41:23. Schloss Dagstuhl–Leibniz-Zentrum f¨ ur Informatik,

2023
[5]

Supersimulators

[DT25] Cynthia Dwork and Pranay Tankala. Supersimulators.arXiv preprint arXiv:2509.17994,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Loss minimization through the lens of outcome indistinguishability

[GHK+23] Parikshit Gopalan, Lunjia Hu, Michael P Kim, Omer Reingold, and Udi Wieder. Loss minimization through the lens of outcome indistinguishability. In14th Innovations in Theoretical Computer Science Conference (ITCS 2023),

2023
[7]

Rothblum

[GHR24] Parikshit Gopalan, Lunjia Hu, and Guy N. Rothblum. On computationally efficient multi-class calibration. InProceedings of Thirty Seventh Conference on Learning Theory, pages 1983–2026. PMLR,

1983
[8]

Pai, and Aaron Roth

[GJN+22] Varun Gupta, Christopher Jung, Georgy Noarov, Mallesh M. Pai, and Aaron Roth. Online multivalid learning: Means, moments, and prediction intervals. In13th Inno- vations in Theoretical Computer Science Conference (ITCS 2022), pages 82:1–82:24. Schloss Dagstuhl – Leibniz-Zentrum f¨ ur Informatik,

2022
[9]

Oracle effi- cient online multicalibration and omniprediction

38 [GJRR24] Sumegha Garg, Christopher Jung, Omer Reingold, and Aaron Roth. Oracle effi- cient online multicalibration and omniprediction. InProceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2725–2792. Society for Industrial and Applied Mathematics,

2024
[10]

Omnipredictors

[GKR+22] Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In13th Innovations in Theoretical Computer Science Con- ference (ITCS 2022), pages 79:1–79:21. Schloss Dagstuhl – Leibniz-Zentrum f¨ ur In- formatik,

2022
[11]

Omnipredictors for regression and the approximate rank of convex functions

[GOR+24] Parikshit Gopalan, Princewill Okoroafor, Prasad Raghavendra, Abhishek Sherry, and Mihir Singhal. Omnipredictors for regression and the approximate rank of convex functions. InThe Thirty Seventh Annual Conference on Learning Theory, pages 2027–

2027
[12]

Tibshirani

[GT25] Isaac Gibbs and Ryan J. Tibshirani. Sample-efficient omniprediction for proper losses. arXiv preprint arXiv:2510.12769,

work page arXiv
[13]

Multi- calibration: Calibration for the (computationally-identifiable) masses

[HJKRR18] Ursula Hebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multi- calibration: Calibration for the (computationally-identifiable) masses. InProceedings of the 35th International Conference on Machine Learning, pages 1939–1948. PMLR,

1939
[14]

arXiv preprint arXiv:2511.04907 , year=

[HLSS25] Lunjia Hu, Haipeng Luo, Spandan Senapati, and Vatsal Sharan. Efficient swap mul- ticalibration of elicitable properties.arXiv preprint arXiv:2511.04907,

work page arXiv
[15]

An exploration of multicalibration uniform convergence bounds.arXiv preprint arXiv:2202.04530,

[RBFJ22] Harrison Rosenberg, Robi Bhattacharjee, Kassem Fawaz, and Somesh Jha. An exploration of multicalibration uniform convergence bounds.arXiv preprint arXiv:2202.04530,

work page arXiv