Recognition: unknown
The Sample Complexity of Multicalibration
Pith reviewed 2026-05-09 23:02 UTC · model grok-4.3
The pith
Multicalibration to error ε requires roughly ε^{-3} samples when the number of groups is polynomial in 1/ε.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For every fixed κ > 0 and any group family G with |G| ≤ ε^{-κ}, the sample complexity of multicalibration to ECE ε is Θ̃(ε^{-3}) in the batch setting. The upper bound is realized by a randomized predictor obtained from an online-to-batch reduction, while the lower bound holds even for randomized predictors. Matching upper and lower bounds, up to polylog factors, also hold for weighted L_p multicalibration with exponent 3/p, and the lower-bound template extends to a regular class of elicitable properties to obtain matching bounds for expectiles and bounded-density quantiles.
What carries the argument
The online-to-batch reduction that turns an online multicalibration algorithm into a batch predictor with the claimed sample complexity, together with a new lower-bound template for regular elicitable properties that produces the matching hardness result.
Load-bearing premise
The family of groups is fixed in advance and its size is bounded by a polynomial in 1/ε.
What would settle it
An explicit algorithm that achieves multicalibration error ε with o(ε^{-3}) samples on every distribution when |G| = ε^{-1}, or a concrete distribution and group family of size ε^{-1} on which every predictor requires at least ε^{-3} samples.
Figures
read the original abstract
We study the minimax sample complexity of multicalibration in the batch setting. A learner observes $n$ i.i.d. samples from an unknown distribution and must output a (possibly randomized) predictor whose population multicalibration error, measured by Expected Calibration Error (ECE), is at most $\varepsilon$ with respect to a given family of groups. For every fixed $\kappa > 0$, in the regime $|G|\le \varepsilon^{-\kappa}$, we prove that $\widetilde{\Theta}(\varepsilon^{-3})$ samples are necessary and sufficient, up to polylogarithmic factors. The lower bound holds even for randomized predictors, and the upper bound is realized by a randomized predictor obtained via an online-to-batch reduction. This separates the sample complexity of multicalibration from that of marginal calibration, which scales as $\widetilde{\Theta}(\varepsilon^{-2})$, and shows that mean-ECE multicalibration is as difficult in the batch setting as it is in the online setting, in contrast to marginal calibration which is strictly more difficult in the online setting. In contrast we observe that for $\kappa = 0$, the sample complexity of multicalibration remains $\widetilde{\Theta}(\varepsilon^{-2})$ exhibiting a sharp threshold phenomenon. More generally, we establish matching upper and lower bounds, up to polylogarithmic factors, for a weighted $L_p$ multicalibration metric for all $1 \le p \le 2$, with optimal exponent $3/p$. We also extend the lower-bound template to a regular class of elicitable properties, and combine it with the online upper bounds of Hu et al. (2025) to obtain matching bounds for calibrating properties including expectiles and bounded-density quantiles.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the minimax sample complexity of achieving ε-multicalibration (via ECE) w.r.t. a fixed group family G in the batch setting with i.i.d. samples. For any fixed κ>0 and |G|≤ε^{-κ}, it proves matching upper and lower bounds of ãΘ(ε^{-3}) (up to polylog factors), with the upper bound via online-to-batch reduction yielding a randomized predictor and the lower bound holding even for randomized predictors. It also gives matching bounds for weighted L_p multicalibration (optimal exponent 3/p for 1≤p≤2) and extends the lower-bound template to a regular class of elicitable properties, recovering matching rates for expectiles and bounded-density quantiles. For κ=0 the rate drops to ãΘ(ε^{-2}), matching marginal calibration and exhibiting a sharp threshold.
Significance. If the claimed matching bounds hold, the result is significant: it sharply separates multicalibration sample complexity from marginal calibration, demonstrates equivalence between batch and online settings for multicalibration (in contrast to marginal calibration), and introduces a reusable lower-bound template for elicitable properties. The phase transition at κ=0 and the handling of randomized predictors are notable. The paper supplies matching upper/lower bounds, an online-to-batch reduction, and extensions beyond the core claim; these are concrete strengths.
minor comments (3)
- [Abstract] Abstract: the parenthetical reference to Hu et al. (2025) for the online upper bounds should be expanded to a one-sentence comparison of the online rate to the new batch rate, to make the claimed separation immediate for readers.
- [Lower bound section] The lower-bound construction (presumably in the section containing the minimax argument) should explicitly state whether the groups in G are required to be measurable with respect to the sigma-algebra generated by the predictor or are allowed to be arbitrary; this affects the scope of the ãΘ(ε^{-3}) claim.
- [Preliminaries or §2] Notation: the weighted L_p multicalibration metric is introduced without an equation number in the abstract; adding a displayed definition early in the paper would improve readability when the 3/p exponent is later invoked.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our paper, as well as the recommendation for minor revision. The referee's description correctly reflects our main results on the minimax sample complexity of multicalibration, the phase transition at κ=0, the online-to-batch reduction, and the extensions to weighted L_p metrics and elicitable properties. Since the report lists no specific major comments, we have no points requiring point-by-point rebuttal or changes at this stage.
Circularity Check
No significant circularity
full rationale
The derivation establishes matching upper and lower bounds via an online-to-batch reduction to independent prior work (Hu et al. 2025) for the upper bound and a direct minimax argument for the lower bound. No load-bearing step reduces by construction to a fitted input, self-definition, or self-citation chain; the claimed Θ̃(ε^{-3}) rate (for |G| ≤ ε^{-κ}) is proven from first principles within the stated assumptions on fixed G. The paper is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Samples are drawn i.i.d. from an unknown distribution
- domain assumption Online multicalibration algorithms exist with known regret bounds
Forward citations
Cited by 1 Pith paper
-
Instance-Adaptive Online Multicalibration
A single online multicalibration algorithm adaptively refines a dyadic grid and achieves instance-dependent rates: O(T^{2/3}) worst-case, O(sqrt T) for marginal stochastic data, and O(sqrt(JT)) for J-piecewise station...
Reference graph
Works this paper leans on
-
[1]
Collaborative prediction: Tractable information aggregation via agreement
[CGHG+26] Natalie Collina, Ira Globus-Harris, Surbhi Goel, Varun Gupta, Aaron Roth, and Mi- rah Shi. Collaborative prediction: Tractable information aggregation via agreement. InProceedings of the 2026 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM,
2026
-
[2]
Optimal Lower Bounds for Online Multicalibration
[CLNR26] Natalie Collina, Jiuyao Lu, Georgy Noarov, and Aaron Roth. Optimal lower bounds for online multicalibration.arXiv preprint arXiv:2601.05245,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Breaking theT 2/3 barrier for sequential calibra- tion
[DDF+25] Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich, Robert Kleinberg, and Princewill Okoroafor. Breaking theT 2/3 barrier for sequential calibra- tion. InProceedings of the 57th Annual ACM Symposium on Theory of Computing, pages 2007–2018. ACM,
2007
-
[4]
Happymap: A generalized mul- ticalibration method
[DDZ23] Zhun Deng, Cynthia Dwork, and Linjun Zhang. Happymap: A generalized mul- ticalibration method. In14th Innovations in Theoretical Computer Science Confer- ence (ITCS 2023), page 41:1–41:23. Schloss Dagstuhl–Leibniz-Zentrum f¨ ur Informatik,
2023
-
[5]
[DT25] Cynthia Dwork and Pranay Tankala. Supersimulators.arXiv preprint arXiv:2509.17994,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Loss minimization through the lens of outcome indistinguishability
[GHK+23] Parikshit Gopalan, Lunjia Hu, Michael P Kim, Omer Reingold, and Udi Wieder. Loss minimization through the lens of outcome indistinguishability. In14th Innovations in Theoretical Computer Science Conference (ITCS 2023),
2023
-
[7]
Rothblum
[GHR24] Parikshit Gopalan, Lunjia Hu, and Guy N. Rothblum. On computationally efficient multi-class calibration. InProceedings of Thirty Seventh Conference on Learning Theory, pages 1983–2026. PMLR,
1983
-
[8]
Pai, and Aaron Roth
[GJN+22] Varun Gupta, Christopher Jung, Georgy Noarov, Mallesh M. Pai, and Aaron Roth. Online multivalid learning: Means, moments, and prediction intervals. In13th Inno- vations in Theoretical Computer Science Conference (ITCS 2022), pages 82:1–82:24. Schloss Dagstuhl – Leibniz-Zentrum f¨ ur Informatik,
2022
-
[9]
Oracle effi- cient online multicalibration and omniprediction
38 [GJRR24] Sumegha Garg, Christopher Jung, Omer Reingold, and Aaron Roth. Oracle effi- cient online multicalibration and omniprediction. InProceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2725–2792. Society for Industrial and Applied Mathematics,
2024
-
[10]
Omnipredictors
[GKR+22] Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In13th Innovations in Theoretical Computer Science Con- ference (ITCS 2022), pages 79:1–79:21. Schloss Dagstuhl – Leibniz-Zentrum f¨ ur In- formatik,
2022
-
[11]
Omnipredictors for regression and the approximate rank of convex functions
[GOR+24] Parikshit Gopalan, Princewill Okoroafor, Prasad Raghavendra, Abhishek Sherry, and Mihir Singhal. Omnipredictors for regression and the approximate rank of convex functions. InThe Thirty Seventh Annual Conference on Learning Theory, pages 2027–
2027
-
[12]
[GT25] Isaac Gibbs and Ryan J. Tibshirani. Sample-efficient omniprediction for proper losses. arXiv preprint arXiv:2510.12769,
-
[13]
Multi- calibration: Calibration for the (computationally-identifiable) masses
[HJKRR18] Ursula Hebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multi- calibration: Calibration for the (computationally-identifiable) masses. InProceedings of the 35th International Conference on Machine Learning, pages 1939–1948. PMLR,
1939
-
[14]
arXiv preprint arXiv:2511.04907 , year=
[HLSS25] Lunjia Hu, Haipeng Luo, Spandan Senapati, and Vatsal Sharan. Efficient swap mul- ticalibration of elicitable properties.arXiv preprint arXiv:2511.04907,
-
[15]
An exploration of multicalibration uniform convergence bounds.arXiv preprint arXiv:2202.04530,
[RBFJ22] Harrison Rosenberg, Robi Bhattacharjee, Kassem Fawaz, and Somesh Jha. An exploration of multicalibration uniform convergence bounds.arXiv preprint arXiv:2202.04530,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.