arxiv: 2604.27723 · v1 · submitted 2026-04-30 · 💻 cs.LG · stat.ML

Recognition: unknown

Optimized Deferral for Imbalanced Settings

Anqi Mao, Corinna Cortes, Mehryar Mohri, Yutao Zhong

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:38 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords learning to deferexpert imbalancecost-sensitive learningmargin-based lossesMILD algorithmLLM routingtwo-stage deferralimbalanced classification

0 comments

The pith

Casting deferral optimization as cost-sensitive learning over input-expert pairs yields algorithms that handle expert imbalance better.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the expert imbalance problem in two-stage learning to defer, where standard methods tend to route everything to the most common expert and produce suboptimal accuracy. It reframes the entire deferral loss as a cost-sensitive classification task defined on the joint domain of inputs and experts rather than on inputs alone. From this view the authors derive fresh margin-based surrogate losses together with generalization guarantees, then build the MILD algorithm that puts these losses to work. The resulting deferral decisions improve both error rates and resource use in settings such as routing queries among several LLMs or classifiers. The work matters because many practical deferral pipelines already possess a fixed collection of experts whose usage frequencies are naturally skewed.

Core claim

We cast the deferral loss optimization as a novel cost-sensitive learning problem over the input-expert domain. We derive new margin-based loss functions and guarantees tailored to this setting, and develop novel algorithms for cost-sensitive learning. Leveraging these results, we design principled deferral algorithms, MILD (Margin-based Imbalanced Learning to Defer), specifically suited for expert imbalance settings.

What carries the argument

The cost-sensitive learning formulation over the input-expert domain, which turns deferral into a weighted classification problem whose weights encode expert imbalance and enables margin-based losses with provable guarantees.

If this is right

MILD outperforms existing deferral baselines on image classification tasks that exhibit expert imbalance.
MILD improves routing accuracy and efficiency when directing queries to collections of LLMs.
The new margin-based losses supply generalization bounds specific to the imbalanced deferral setting.
The cost-sensitive algorithms developed for the input-expert domain can be reused for other routing or selection tasks with skewed expert usage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same input-expert cost-sensitive view could be applied to deferral pipelines that involve more than one deferral stage.
Similar cost-sensitive reductions might help balance load across heterogeneous models in distributed inference systems.
Empirical tests that vary the degree of imbalance while holding other factors fixed would clarify how MILD scales with skew severity.
The approach suggests examining whether cost-sensitive losses can also mitigate imbalance when the experts themselves are being trained rather than fixed.

Load-bearing premise

That treating deferral as cost-sensitive classification in the input-expert product space produces loss functions and algorithms whose theoretical guarantees translate into measurable gains on real imbalanced data.

What would settle it

Train MILD and standard two-stage deferral baselines on a controlled dataset with known expert imbalance, then measure whether MILD produces a statistically significant drop in overall error rate or deferral cost; failure to do so would refute the central claim.

Figures

Figures reproduced from arXiv: 2604.27723 by Anqi Mao, Corinna Cortes, Mehryar Mohri, Yutao Zhong.

**Figure 1.** Figure 1: 1-D example with 4 classes (colored densities) and 3 experts (gray lines indicating accuracies). Experts have increasing costs indicated by darker shades of gray. not class labels. A 1-D example is provided in view at source ↗

read the original abstract

Learning algorithms can be significantly improved by routing complex or uncertain inputs to specialized experts, balancing accuracy with computational cost. This approach, known as learning to defer, is essential in domains like natural language generation, medical diagnosis, and computer vision, where an effective deferral can reduce errors at low extra resource consumption. However, the two-stage learning to defer setting, which leverages existing predictors such as a collection of LLMs or other classifiers, often faces challenges due to an expert imbalance problem. This imbalance can lead to suboptimal performance, with deferral algorithms favoring the majority expert. We present a comprehensive study of two-stage learning to defer in expert imbalance settings. We cast the deferral loss optimization as a novel cost-sensitive learning problem over the input-expert domain. We derive new margin-based loss functions and guarantees tailored to this setting, and develop novel algorithms for cost-sensitive learning. Leveraging these results, we design principled deferral algorithms, MILD (Margin-based Imbalanced Learning to Defer), specifically suited for expert imbalance settings. Extensive experiments demonstrate the effectiveness of our approach, showing clear improvements over existing baselines on both image classification and real-world Large Language Model (LLM) routing tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper recasts imbalanced deferral as cost-sensitive learning over input-expert pairs, derives margin losses for it, and introduces MILD, which shows gains on image and LLM tasks but keeps the new guarantees light on detail.

read the letter

The main thing to know about this paper is that it treats the expert imbalance problem in two-stage learning to defer by recasting the whole thing as cost-sensitive classification over the input-expert pairs. From there they derive some margin-based loss functions with associated guarantees and build the MILD algorithm on top. They test it on standard image datasets and on LLM routing tasks, claiming better performance than prior deferral methods when experts are imbalanced. What stands out is how they connect the deferral setup to existing cost-sensitive techniques in a way that directly addresses the bias toward frequent experts. That modeling step feels natural and the experiments apparently back it up with clear improvements, which is useful for anyone trying to route queries among multiple models without one dominating. The fact that they include real-world LLM examples helps ground the work. On the downside, the guarantees are asserted without much detail on the proof strategy or the exact margin conditions, so it's hard to tell if they are loose or if they require strong assumptions on the cost matrix. Cost-sensitive learning can be sensitive to how the costs are chosen, and there's no obvious discussion of robustness to cost misspecification or how to set them in practice for deferral. The experiments would be stronger with more ablations on the degree of imbalance and direct comparisons to other imbalance-handling techniques like reweighting or resampling. The formulation itself looks internally consistent and the citation pattern pulls from the right deferral and cost-sensitive literature without obvious holes. This paper is for people working on learning to defer, model routing, or cost-sensitive methods in general. It deserves a serious referee because the problem is well-motivated, the approach is coherent, and the empirical results suggest it moves the needle on a real issue. The thinking seems careful and engaged with the literature. I would recommend sending it out for peer review.

Referee Report

0 major / 4 minor

Summary. The manuscript examines two-stage learning to defer under expert imbalance, where policies tend to favor majority experts. It reformulates deferral optimization as cost-sensitive learning over the joint input-expert domain, derives new margin-based surrogate losses together with generalization guarantees, develops supporting algorithms for cost-sensitive learning, and introduces the MILD algorithm. Experiments on image classification and LLM routing tasks are reported to show improvements over existing baselines.

Significance. If the derived margin-based losses and associated guarantees are valid, the work supplies a principled, cost-sensitive treatment of expert imbalance that is directly relevant to practical deferral settings such as LLM routing. The modeling choice of operating in the input-expert space is coherent with existing cost-sensitive techniques and the empirical evaluation on both vision and language tasks provides concrete evidence of utility. The derivation of tailored losses and the focus on imbalance constitute the primary contributions.

minor comments (4)

[§3.1] §3.1, Definition 1: the cost matrix C(x, e) is introduced without an explicit statement of how the imbalance ratios are encoded; a short paragraph clarifying the mapping from observed expert frequencies to the cost entries would improve readability.
[§4.2] §4.2, Theorem 2: the generalization bound is stated in terms of the Rademacher complexity of the joint hypothesis class; it would be helpful to include a brief comparison (one sentence) to the corresponding bound for standard cost-sensitive classification to highlight the novelty of the input-expert formulation.
[Figure 3] Figure 3 and Table 2: axis labels and legend entries use inconsistent abbreviations (e.g., “MILD” vs. “Mild”); uniform notation across all figures and tables is needed.
[§5.3] §5.3: the LLM routing experiments report accuracy and deferral rate but do not include a statistical significance test across the five random seeds; adding p-values or confidence intervals would strengthen the empirical claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the positive evaluation, including the accurate summary of our contributions and the recommendation for minor revision. The referee correctly identifies the core technical approach: reformulating two-stage deferral under expert imbalance as cost-sensitive learning over the input-expert domain, deriving margin-based surrogate losses with generalization guarantees, and introducing the MILD algorithm. We will incorporate any minor suggestions in the revised version.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via standard cost-sensitive modeling

full rationale

The paper casts deferral loss optimization as a cost-sensitive learning problem over the input-expert domain, then derives new margin-based losses, guarantees, and the MILD algorithm from that formulation. This modeling step is a coherent extension of existing imbalance-handling techniques rather than a self-definitional loop or a fitted parameter renamed as a prediction. The abstract explicitly presents the losses and algorithms as derived results, with no indication that they reduce by construction to the input data or to prior self-citations that bear the central claim. Experiments on image classification and LLM routing tasks supply external validation outside the derivation. No load-bearing equation or uniqueness theorem is shown to collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on parameters axioms or new entities.

pith-pipeline@v0.9.0 · 8601 in / 927 out tokens · 61683 ms · 2026-05-07T07:38:12.480714+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

143 extracted references · 16 canonical work pages · 5 internal anchors

[1]

H -consistency bounds for surrogate loss minimizers

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. H -consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pp.\ 1117--1174, 2022 a

2022
[2]

Multi-class H -consistency bounds

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Multi-class H -consistency bounds. In Advances in Neural Information Processing Systems, pp.\ 782--795, 2022 b

2022
[3]

L., Jordan, M

Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101 0 (473): 0 138--156, 2006

2006
[4]

L., Foster, D

Bartlett, P. L., Foster, D. J., and Telgarsky, M. Spectrally-normalized margin bounds for neural networks. CoRR, abs/1706.08498, 2017

work page arXiv 2017
[5]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review arXiv 2023
[6]

Learning imbalanced datasets with label-distribution-aware margin loss

Cao, K., Wei, C., Gaidon, A., Arechiga, N., and Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, 2019

2019
[7]

Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses

Cao, Y., Cai, T., Feng, L., Gu, L., Gu, J., An, B., Niu, G., and Sugiyama, M. Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses. In Advances in Neural Information Processing Systems, 2022

2022
[8]

Sample efficient learning of predictors that complement humans

Charusaie, M.-A., Mozannar, H., Sontag, D., and Samadi, S. Sample efficient learning of predictors that complement humans. In International Conference on Machine Learning, pp.\ 2972--3005, 2022

2022
[9]

V., Bowyer, K

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. SMOTE : synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 0 321--357, 2002

2002
[10]

Regression with cost-based rejection

Cheng, X., Cao, Y., Wang, H., Wei, H., An, B., and Feng, L. Regression with cost-based rejection. In Advances in Neural Information Processing Systems, 2023

2023
[11]

Learning with rejection

Cortes, C., DeSalvo, G., and Mohri, M. Learning with rejection. In International Conference on Algorithmic Learning Theory, pp.\ 67--82, 2016 a

2016
[12]

Boosting with abstention

Cortes, C., DeSalvo, G., and Mohri, M. Boosting with abstention. In Advances in Neural Information Processing Systems, pp.\ 1660--1668, 2016 b

2016
[13]

Structured prediction theory based on factor graph complexity

Cortes, C., Kuznetsov, V., Mohri, M., and Yang, S. Structured prediction theory based on factor graph complexity. In Advances in Neural Information Processing Systems, 2016 c

2016
[14]

Adanet: Adaptive structural learning of artificial neural networks

Cortes, C., Gonzalvo, X., Kuznetsov, V., Mohri, M., and Yang, S. Adanet: Adaptive structural learning of artificial neural networks. In International Conference on Machine Learning, pp.\ 874--883, 2017

2017
[15]

Theory and algorithms for learning with rejection in binary classification

Cortes, C., DeSalvo, G., and Mohri, M. Theory and algorithms for learning with rejection in binary classification. Annals of Mathematics and Artificial Intelligence, 92 0 (2): 0 277--315, 2024 a

2024
[16]

Cardinality-aware set prediction and top- k classification

Cortes, C., Mao, A., Mohri, C., Mohri, M., and Zhong, Y. Cardinality-aware set prediction and top- k classification. In Advances in Neural Information Processing Systems, 2024 b

2024
[17]

Balancing the scales: A theoretical and algorithmic framework for learning from imbalanced data

Cortes, C., Mao, A., Mohri, M., and Zhong, Y. Balancing the scales: A theoretical and algorithmic framework for learning from imbalanced data. In International Conference on Machine Learning, 2025

2025
[18]

A theoretical framework for modular learning of robust generative models

Cortes, C., Mohri, M., and Zhong, Y. A theoretical framework for modular learning of robust generative models. In International Conference on Machine Learning, 2026

2026
[19]

Parametric contrastive learning

Cui, J., Zhong, Z., Liu, S., Yu, B., and Jia, J. Parametric contrastive learning. In International Conference on Computer Vision, 2021

2021
[20]

Reslt: Residual learning for long-tailed recognition

Cui, J., Liu, S., Tian, Z., Zhong, Z., and Jia, J. Reslt: Residual learning for long-tailed recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

2022
[21]

Class-balanced loss based on effective number of samples

Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 9268--9277, 2019

2019
[22]

Regression under human assistance

De, A., Koley, P., Ganguly, N., and Gomez-Rodriguez, M. Regression under human assistance. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 2611--2620, 2020

2020
[23]

Budgeted multiple-expert deferral

DeSalvo, G., Mohri, C., Mohri, M., and Zhong, Y. Budgeted multiple-expert deferral. arXiv preprint arXiv:2510.26706, 2025

work page arXiv 2025
[24]

Simpro: A simple probabilistic framework towards realistic long-tailed semi-supervised learning

Du, C., Han, Y., and Huang, G. Simpro: A simple probabilistic framework towards realistic long-tailed semi-supervised learning. In International Conference on Machine Learning, 2024

2024
[25]

and Wiener, Y

El-Yaniv, R. and Wiener, Y. Active learning via perfect selective classification. Journal of Machine Learning Research, 13 0 (2), 2012

2012
[26]

El-Yaniv, R. et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11 0 (5), 2010

2010
[27]

The foundations of cost-sensitive learning

Elkan, C. The foundations of cost-sensitive learning. In International Joint Conference on Artificial Intelligence, 2001

2001
[28]

A multiple resampling method for learning from imbalanced data sets

Estabrooks, A., Jo, T., and Japkowicz, N. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20 0 (1): 0 18--36, 2004

2004
[29]

Learning with average top-k loss

Fan, Y., Lyu, S., Ying, Y., and Hu, B. Learning with average top-k loss. In Advances in Neural Information Processing Systems, pp.\ 497--505, 2017

2017
[30]

Gabidolla, M., Zharmagambetov, A., and Carreira - Perpi \ n \' a n, M. \' A . Beyond the ROC curve: Classification trees using cost-optimal curves, with application to imbalanced datasets. In International Conference on Machine Learning, 2024

2024
[31]

Enhancing minority classes by mixing: an adaptative optimal transport approach for long-tailed classification

Gao, J., Zhao, H., Li, Z., and Guo, D. Enhancing minority classes by mixing: an adaptative optimal transport approach for long-tailed classification. In Advances in Neural Information Processing Systems, 2023

2023
[32]

Distribution alignment optimization through neural collapse for long-tailed classification

Gao, J., Zhao, H., dan Guo, D., and Zha, H. Distribution alignment optimization through neural collapse for long-tailed classification. In International Conference on Machine Learning, 2024

2024
[33]

and El-Yaniv, R

Geifman, Y. and El-Yaniv, R. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, 2017

2017
[34]

and El-Yaniv, R

Geifman, Y. and El-Yaniv, R. Selectivenet: A deep neural network with an integrated reject option. In International Conference on Machine Learning, pp.\ 2151--2159, 2019

2019
[35]

Wrapped cauchy distributed angular softmax for long-tailed visual recognition

Han, B. Wrapped cauchy distributed angular softmax for long-tailed visual recognition. In International Conference on Machine Learning, pp.\ 12368--12388, 2023

2023
[36]

Borderline-smote: a new over-sampling method in imbalanced data sets learning

Han, H., Wang, W.-Y., and Mao, B.-H. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, pp.\ 878--887, 2005

2005
[37]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 770--778, 2016

2016
[38]

Pengcheng He, Jianfeng Gao, and Weizhu Chen

He, P., Gao, J., and Chen, W. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021

work page arXiv 2021
[39]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review arXiv 2009
[40]

Disentangling label distribution for long-tailed visual recognition

Hong, Y., Han, S., Choi, K., Seo, S., Kim, B., and Chang, B. Disentangling label distribution for long-tailed visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2021
[41]

Cost-sensitive support vector machines

Iranmehr, A., Masnadi - Shirazi, H., and Vasconcelos, N. Cost-sensitive support vector machines. Neurocomputing, 343: 0 50--64, 2019

2019
[42]

A., Brown, M., Yang, M.-H., Wang, L., and Gong, B

Jamal, M. A., Brown, M., Yang, M.-H., Wang, L., and Gong, B. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 7610--7619, 2020

2020
[43]

Risk-controlled selective prediction for regression deep neural network models

Jiang, W., Zhao, Y., and Wang, Z. Risk-controlled selective prediction for regression deep neural network models. In International Joint Conference on Neural Networks, pp.\ 1--8, 2020

2020
[44]

Balanced meta-softmax for long-tailed visual recognition

Jiawei, R., Yu, C., Ma, X., Zhao, H., Yi, S., et al. Balanced meta-softmax for long-tailed visual recognition. In Advances in Neural Information Processing Systems, 2020

2020
[45]

Decoupling representation and classifier for long-tailed recognition

Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., and Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. In International Conference on Learning Representations, 2020

2020
[46]

Maximum class separation as inductive bias in one matrix

Kasarla, T., Burghouts, G., Van Spengler, M., Van Der Pol, E., Cucchiara, R., and Mettes, P. Maximum class separation as inductive bias in one matrix. In Advances in Neural Information Processing Systems, pp.\ 19553--19566, 2022

2022
[47]

Towards unbiased and accurate deferral to multiple experts

Keswani, V., Lease, M., and Kenthapadi, K. Towards unbiased and accurate deferral to multiple experts. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp.\ 154--165, 2021

2021
[48]

W., Shen, J., and Shao, L

Khan, S., Hayat, M., Zamir, S. W., Shen, J., and Shao, L. Striking the right balance with uncertainty. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 103--112, 2019

2019
[49]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review arXiv 2014
[50]

R., Paraskevas, O., Oymak, S., and Thrampoulidis, C

Kini, G. R., Paraskevas, O., Oymak, S., and Thrampoulidis, C. Label-imbalanced and group-sensitive classification under overparameterization. In Advances in Neural Information Processing Systems, pp.\ 18970--18983, 2021

2021
[51]

Learning multiple layers of features from tiny images

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Toronto University, 2009

2009
[52]

and Matwin, S

Kubat, M. and Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection. In International Conference on Machine Learning, 1997

1997
[53]

and Yang, X

Le, Y. and Yang, X. Tiny imagenet visual recognition challenge. CS 231N, 7 0 (7): 0 3, 2015

2015
[54]

Size-invariance matters: Rethinking metrics and losses for imbalanced multi-object salient object detection

Li, F., Xu, Q., Bao, S., Yang, Z., Cong, R., Cao, X., and Huang, Q. Size-invariance matters: Rethinking metrics and losses for imbalanced multi-object salient object detection. In International Conference on Machine Learning, 2024

2024
[55]

When no-rejection learning is optimal for regression with rejection

Li, X., Liu, S., Sun, C., and Wang, H. When no-rejection learning is optimal for regression with rejection. arXiv preprint arXiv:2307.02932, 2023

work page arXiv 2023
[56]

Focal loss for dense object detection

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll \'a r, P. Focal loss for dense object detection. In International Conference on Computer Vision, pp.\ 2980--2988, 2017

2017
[57]

Incorporating uncertainty in learning to defer algorithms for safe computer-aided diagnosis

Liu, J., Gallego, B., and Barbieri, S. Incorporating uncertainty in learning to defer algorithms for safe computer-aided diagnosis. Scientific Reports, 12 0 (1): 0 1762, 2022

2022
[58]

Elta: An enhancer against long-tail for aesthetics-oriented models

Liu, L., He, S., Ming, A., Xie, R., and Ma, H. Elta: An enhancer against long-tail for aesthetics-oriented models. In International Conference on Machine Learning, 2024

2024
[59]

Exploratory undersampling for class-imbalance learning

Liu, X.-Y., Wu, J., and Zhou, Z.-H. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, 39 0 (2): 0 539--550, 2008

2008
[60]

Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S. X. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 2537--2546, 2019

2019
[61]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review arXiv 2017
[62]

Learning adversarially fair and transferable representations

Madras, D., Creager, E., Pitassi, T., and Zemel, R. Learning adversarially fair and transferable representations. arXiv preprint arXiv:1802.06309, 2018

work page arXiv 2018
[63]

Theory and Algorithms for Learning with Multi-Class Abstention and Multi-Expert Deferral

Mao, A. Theory and Algorithms for Learning with Multi-Class Abstention and Multi-Expert Deferral. PhD thesis, New York University, 2025

2025
[64]

Two-stage learning to defer with multiple experts

Mao, A., Mohri, C., Mohri, M., and Zhong, Y. Two-stage learning to defer with multiple experts. In Advances in Neural Information Processing Systems, 2023 a

2023
[65]

H -consistency bounds: Characterization and extensions

Mao, A., Mohri, M., and Zhong, Y. H -consistency bounds: Characterization and extensions. In Advances in Neural Information Processing Systems, 2023 b

2023
[66]

H -consistency bounds for pairwise misranking loss surrogates

Mao, A., Mohri, M., and Zhong, Y. H -consistency bounds for pairwise misranking loss surrogates. In International Conference on Machine learning, 2023 c

2023
[67]

Ranking with abstention

Mao, A., Mohri, M., and Zhong, Y. Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023 d

2023
[68]

Cross-entropy loss functions: Theoretical analysis and applications

Mao, A., Mohri, M., and Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning, 2023 e

2023
[69]

Structured prediction with stronger consistency guarantees

Mao, A., Mohri, M., and Zhong, Y. Structured prediction with stronger consistency guarantees. In Advances in Neural Information Processing Systems, pp.\ 46903--46937, 2023 f

2023
[70]

Principled approaches for learning to defer with multiple experts

Mao, A., Mohri, M., and Zhong, Y. Principled approaches for learning to defer with multiple experts. In International Symposium on Artificial Intelligence and Mathematics, 2024 a

2024
[71]

Predictor-rejector multi-class abstention: Theoretical analysis and algorithms

Mao, A., Mohri, M., and Zhong, Y. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. In International Conference on Algorithmic Learning Theory, pp.\ 822--867, 2024 b

2024
[72]

Theoretically grounded loss functions and algorithms for score-based multi-class abstention

Mao, A., Mohri, M., and Zhong, Y. Theoretically grounded loss functions and algorithms for score-based multi-class abstention. In International Conference on Artificial Intelligence and Statistics, pp.\ 4753--4761, 2024 c

2024
[73]

H -consistency guarantees for regression

Mao, A., Mohri, M., and Zhong, Y. H -consistency guarantees for regression. In International Conference on Machine Learning, pp.\ 34712--34737, 2024 d

2024
[74]

Multi-label learning with stronger consistency guarantees

Mao, A., Mohri, M., and Zhong, Y. Multi-label learning with stronger consistency guarantees. In Advances in Neural Information Processing Systems, 2024 e

2024
[75]

Regression with multi-expert deferral

Mao, A., Mohri, M., and Zhong, Y. Regression with multi-expert deferral. In International Conference on Machine Learning, pp.\ 34738--34759, 2024 f

2024
[76]

A universal growth rate for learning with smooth surrogate losses

Mao, A., Mohri, M., and Zhong, Y. A universal growth rate for learning with smooth surrogate losses. In Advances in Neural Information Processing Systems, 2024 g

2024
[77]

Realizable H -consistent and B ayes-consistent loss functions for learning to defer

Mao, A., Mohri, M., and Zhong, Y. Realizable H -consistent and B ayes-consistent loss functions for learning to defer. In Advances in Neural Information Processing Systems, 2024 h

2024
[78]

Mastering multiple-expert routing: Realizable H -consistency and strong guarantees for learning to defer

Mao, A., Mohri, M., and Zhong, Y. Mastering multiple-expert routing: Realizable H -consistency and strong guarantees for learning to defer. In International Conference on Machine Learning, 2025 a

2025
[79]

Principled algorithms for optimizing generalized metrics in binary classification

Mao, A., Mohri, M., and Zhong, Y. Principled algorithms for optimizing generalized metrics in binary classification. In International Conference on Machine Learning, 2025 b

2025
[80]

Enhanced -consistency bounds

Mao, A., Mohri, M., and Zhong, Y. Enhanced -consistency bounds. In International Conference on Algorithmic Learning Theory, 2025 c

2025

Showing first 80 references.