pith. machine review for the scientific record. sign in

arxiv: 2604.27723 · v1 · submitted 2026-04-30 · 💻 cs.LG · stat.ML

Recognition: unknown

Optimized Deferral for Imbalanced Settings

Anqi Mao, Corinna Cortes, Mehryar Mohri, Yutao Zhong

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:38 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords learning to deferexpert imbalancecost-sensitive learningmargin-based lossesMILD algorithmLLM routingtwo-stage deferralimbalanced classification
0
0 comments X

The pith

Casting deferral optimization as cost-sensitive learning over input-expert pairs yields algorithms that handle expert imbalance better.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the expert imbalance problem in two-stage learning to defer, where standard methods tend to route everything to the most common expert and produce suboptimal accuracy. It reframes the entire deferral loss as a cost-sensitive classification task defined on the joint domain of inputs and experts rather than on inputs alone. From this view the authors derive fresh margin-based surrogate losses together with generalization guarantees, then build the MILD algorithm that puts these losses to work. The resulting deferral decisions improve both error rates and resource use in settings such as routing queries among several LLMs or classifiers. The work matters because many practical deferral pipelines already possess a fixed collection of experts whose usage frequencies are naturally skewed.

Core claim

We cast the deferral loss optimization as a novel cost-sensitive learning problem over the input-expert domain. We derive new margin-based loss functions and guarantees tailored to this setting, and develop novel algorithms for cost-sensitive learning. Leveraging these results, we design principled deferral algorithms, MILD (Margin-based Imbalanced Learning to Defer), specifically suited for expert imbalance settings.

What carries the argument

The cost-sensitive learning formulation over the input-expert domain, which turns deferral into a weighted classification problem whose weights encode expert imbalance and enables margin-based losses with provable guarantees.

If this is right

  • MILD outperforms existing deferral baselines on image classification tasks that exhibit expert imbalance.
  • MILD improves routing accuracy and efficiency when directing queries to collections of LLMs.
  • The new margin-based losses supply generalization bounds specific to the imbalanced deferral setting.
  • The cost-sensitive algorithms developed for the input-expert domain can be reused for other routing or selection tasks with skewed expert usage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same input-expert cost-sensitive view could be applied to deferral pipelines that involve more than one deferral stage.
  • Similar cost-sensitive reductions might help balance load across heterogeneous models in distributed inference systems.
  • Empirical tests that vary the degree of imbalance while holding other factors fixed would clarify how MILD scales with skew severity.
  • The approach suggests examining whether cost-sensitive losses can also mitigate imbalance when the experts themselves are being trained rather than fixed.

Load-bearing premise

That treating deferral as cost-sensitive classification in the input-expert product space produces loss functions and algorithms whose theoretical guarantees translate into measurable gains on real imbalanced data.

What would settle it

Train MILD and standard two-stage deferral baselines on a controlled dataset with known expert imbalance, then measure whether MILD produces a statistically significant drop in overall error rate or deferral cost; failure to do so would refute the central claim.

Figures

Figures reproduced from arXiv: 2604.27723 by Anqi Mao, Corinna Cortes, Mehryar Mohri, Yutao Zhong.

Figure 1
Figure 1. Figure 1: 1-D example with 4 classes (colored densities) and 3 experts (gray lines indicating accuracies). Experts have increasing costs indicated by darker shades of gray. not class labels. A 1-D example is provided in view at source ↗
read the original abstract

Learning algorithms can be significantly improved by routing complex or uncertain inputs to specialized experts, balancing accuracy with computational cost. This approach, known as learning to defer, is essential in domains like natural language generation, medical diagnosis, and computer vision, where an effective deferral can reduce errors at low extra resource consumption. However, the two-stage learning to defer setting, which leverages existing predictors such as a collection of LLMs or other classifiers, often faces challenges due to an expert imbalance problem. This imbalance can lead to suboptimal performance, with deferral algorithms favoring the majority expert. We present a comprehensive study of two-stage learning to defer in expert imbalance settings. We cast the deferral loss optimization as a novel cost-sensitive learning problem over the input-expert domain. We derive new margin-based loss functions and guarantees tailored to this setting, and develop novel algorithms for cost-sensitive learning. Leveraging these results, we design principled deferral algorithms, MILD (Margin-based Imbalanced Learning to Defer), specifically suited for expert imbalance settings. Extensive experiments demonstrate the effectiveness of our approach, showing clear improvements over existing baselines on both image classification and real-world Large Language Model (LLM) routing tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The manuscript examines two-stage learning to defer under expert imbalance, where policies tend to favor majority experts. It reformulates deferral optimization as cost-sensitive learning over the joint input-expert domain, derives new margin-based surrogate losses together with generalization guarantees, develops supporting algorithms for cost-sensitive learning, and introduces the MILD algorithm. Experiments on image classification and LLM routing tasks are reported to show improvements over existing baselines.

Significance. If the derived margin-based losses and associated guarantees are valid, the work supplies a principled, cost-sensitive treatment of expert imbalance that is directly relevant to practical deferral settings such as LLM routing. The modeling choice of operating in the input-expert space is coherent with existing cost-sensitive techniques and the empirical evaluation on both vision and language tasks provides concrete evidence of utility. The derivation of tailored losses and the focus on imbalance constitute the primary contributions.

minor comments (4)
  1. [§3.1] §3.1, Definition 1: the cost matrix C(x, e) is introduced without an explicit statement of how the imbalance ratios are encoded; a short paragraph clarifying the mapping from observed expert frequencies to the cost entries would improve readability.
  2. [§4.2] §4.2, Theorem 2: the generalization bound is stated in terms of the Rademacher complexity of the joint hypothesis class; it would be helpful to include a brief comparison (one sentence) to the corresponding bound for standard cost-sensitive classification to highlight the novelty of the input-expert formulation.
  3. [Figure 3] Figure 3 and Table 2: axis labels and legend entries use inconsistent abbreviations (e.g., “MILD” vs. “Mild”); uniform notation across all figures and tables is needed.
  4. [§5.3] §5.3: the LLM routing experiments report accuracy and deferral rate but do not include a statistical significance test across the five random seeds; adding p-values or confidence intervals would strengthen the empirical claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the positive evaluation, including the accurate summary of our contributions and the recommendation for minor revision. The referee correctly identifies the core technical approach: reformulating two-stage deferral under expert imbalance as cost-sensitive learning over the input-expert domain, deriving margin-based surrogate losses with generalization guarantees, and introducing the MILD algorithm. We will incorporate any minor suggestions in the revised version.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via standard cost-sensitive modeling

full rationale

The paper casts deferral loss optimization as a cost-sensitive learning problem over the input-expert domain, then derives new margin-based losses, guarantees, and the MILD algorithm from that formulation. This modeling step is a coherent extension of existing imbalance-handling techniques rather than a self-definitional loop or a fitted parameter renamed as a prediction. The abstract explicitly presents the losses and algorithms as derived results, with no indication that they reduce by construction to the input data or to prior self-citations that bear the central claim. Experiments on image classification and LLM routing tasks supply external validation outside the derivation. No load-bearing equation or uniqueness theorem is shown to collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on parameters axioms or new entities.

pith-pipeline@v0.9.0 · 8601 in / 927 out tokens · 61683 ms · 2026-05-07T07:38:12.480714+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

143 extracted references · 16 canonical work pages · 5 internal anchors

  1. [1]

    H -consistency bounds for surrogate loss minimizers

    Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. H -consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pp.\ 1117--1174, 2022 a

  2. [2]

    Multi-class H -consistency bounds

    Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Multi-class H -consistency bounds. In Advances in Neural Information Processing Systems, pp.\ 782--795, 2022 b

  3. [3]

    L., Jordan, M

    Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101 0 (473): 0 138--156, 2006

  4. [4]

    L., Foster, D

    Bartlett, P. L., Foster, D. J., and Telgarsky, M. Spectrally-normalized margin bounds for neural networks. CoRR, abs/1706.08498, 2017

  5. [5]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

  6. [6]

    Learning imbalanced datasets with label-distribution-aware margin loss

    Cao, K., Wei, C., Gaidon, A., Arechiga, N., and Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, 2019

  7. [7]

    Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses

    Cao, Y., Cai, T., Feng, L., Gu, L., Gu, J., An, B., Niu, G., and Sugiyama, M. Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses. In Advances in Neural Information Processing Systems, 2022

  8. [8]

    Sample efficient learning of predictors that complement humans

    Charusaie, M.-A., Mozannar, H., Sontag, D., and Samadi, S. Sample efficient learning of predictors that complement humans. In International Conference on Machine Learning, pp.\ 2972--3005, 2022

  9. [9]

    V., Bowyer, K

    Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. SMOTE : synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 0 321--357, 2002

  10. [10]

    Regression with cost-based rejection

    Cheng, X., Cao, Y., Wang, H., Wei, H., An, B., and Feng, L. Regression with cost-based rejection. In Advances in Neural Information Processing Systems, 2023

  11. [11]

    Learning with rejection

    Cortes, C., DeSalvo, G., and Mohri, M. Learning with rejection. In International Conference on Algorithmic Learning Theory, pp.\ 67--82, 2016 a

  12. [12]

    Boosting with abstention

    Cortes, C., DeSalvo, G., and Mohri, M. Boosting with abstention. In Advances in Neural Information Processing Systems, pp.\ 1660--1668, 2016 b

  13. [13]

    Structured prediction theory based on factor graph complexity

    Cortes, C., Kuznetsov, V., Mohri, M., and Yang, S. Structured prediction theory based on factor graph complexity. In Advances in Neural Information Processing Systems, 2016 c

  14. [14]

    Adanet: Adaptive structural learning of artificial neural networks

    Cortes, C., Gonzalvo, X., Kuznetsov, V., Mohri, M., and Yang, S. Adanet: Adaptive structural learning of artificial neural networks. In International Conference on Machine Learning, pp.\ 874--883, 2017

  15. [15]

    Theory and algorithms for learning with rejection in binary classification

    Cortes, C., DeSalvo, G., and Mohri, M. Theory and algorithms for learning with rejection in binary classification. Annals of Mathematics and Artificial Intelligence, 92 0 (2): 0 277--315, 2024 a

  16. [16]

    Cardinality-aware set prediction and top- k classification

    Cortes, C., Mao, A., Mohri, C., Mohri, M., and Zhong, Y. Cardinality-aware set prediction and top- k classification. In Advances in Neural Information Processing Systems, 2024 b

  17. [17]

    Balancing the scales: A theoretical and algorithmic framework for learning from imbalanced data

    Cortes, C., Mao, A., Mohri, M., and Zhong, Y. Balancing the scales: A theoretical and algorithmic framework for learning from imbalanced data. In International Conference on Machine Learning, 2025

  18. [18]

    A theoretical framework for modular learning of robust generative models

    Cortes, C., Mohri, M., and Zhong, Y. A theoretical framework for modular learning of robust generative models. In International Conference on Machine Learning, 2026

  19. [19]

    Parametric contrastive learning

    Cui, J., Zhong, Z., Liu, S., Yu, B., and Jia, J. Parametric contrastive learning. In International Conference on Computer Vision, 2021

  20. [20]

    Reslt: Residual learning for long-tailed recognition

    Cui, J., Liu, S., Tian, Z., Zhong, Z., and Jia, J. Reslt: Residual learning for long-tailed recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

  21. [21]

    Class-balanced loss based on effective number of samples

    Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 9268--9277, 2019

  22. [22]

    Regression under human assistance

    De, A., Koley, P., Ganguly, N., and Gomez-Rodriguez, M. Regression under human assistance. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 2611--2620, 2020

  23. [23]

    Budgeted multiple-expert deferral

    DeSalvo, G., Mohri, C., Mohri, M., and Zhong, Y. Budgeted multiple-expert deferral. arXiv preprint arXiv:2510.26706, 2025

  24. [24]

    Simpro: A simple probabilistic framework towards realistic long-tailed semi-supervised learning

    Du, C., Han, Y., and Huang, G. Simpro: A simple probabilistic framework towards realistic long-tailed semi-supervised learning. In International Conference on Machine Learning, 2024

  25. [25]

    and Wiener, Y

    El-Yaniv, R. and Wiener, Y. Active learning via perfect selective classification. Journal of Machine Learning Research, 13 0 (2), 2012

  26. [26]

    El-Yaniv, R. et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11 0 (5), 2010

  27. [27]

    The foundations of cost-sensitive learning

    Elkan, C. The foundations of cost-sensitive learning. In International Joint Conference on Artificial Intelligence, 2001

  28. [28]

    A multiple resampling method for learning from imbalanced data sets

    Estabrooks, A., Jo, T., and Japkowicz, N. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20 0 (1): 0 18--36, 2004

  29. [29]

    Learning with average top-k loss

    Fan, Y., Lyu, S., Ying, Y., and Hu, B. Learning with average top-k loss. In Advances in Neural Information Processing Systems, pp.\ 497--505, 2017

  30. [30]

    Gabidolla, M., Zharmagambetov, A., and Carreira - Perpi \ n \' a n, M. \' A . Beyond the ROC curve: Classification trees using cost-optimal curves, with application to imbalanced datasets. In International Conference on Machine Learning, 2024

  31. [31]

    Enhancing minority classes by mixing: an adaptative optimal transport approach for long-tailed classification

    Gao, J., Zhao, H., Li, Z., and Guo, D. Enhancing minority classes by mixing: an adaptative optimal transport approach for long-tailed classification. In Advances in Neural Information Processing Systems, 2023

  32. [32]

    Distribution alignment optimization through neural collapse for long-tailed classification

    Gao, J., Zhao, H., dan Guo, D., and Zha, H. Distribution alignment optimization through neural collapse for long-tailed classification. In International Conference on Machine Learning, 2024

  33. [33]

    and El-Yaniv, R

    Geifman, Y. and El-Yaniv, R. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, 2017

  34. [34]

    and El-Yaniv, R

    Geifman, Y. and El-Yaniv, R. Selectivenet: A deep neural network with an integrated reject option. In International Conference on Machine Learning, pp.\ 2151--2159, 2019

  35. [35]

    Wrapped cauchy distributed angular softmax for long-tailed visual recognition

    Han, B. Wrapped cauchy distributed angular softmax for long-tailed visual recognition. In International Conference on Machine Learning, pp.\ 12368--12388, 2023

  36. [36]

    Borderline-smote: a new over-sampling method in imbalanced data sets learning

    Han, H., Wang, W.-Y., and Mao, B.-H. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, pp.\ 878--887, 2005

  37. [37]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 770--778, 2016

  38. [38]

    Pengcheng He, Jianfeng Gao, and Weizhu Chen

    He, P., Gao, J., and Chen, W. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021

  39. [39]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  40. [40]

    Disentangling label distribution for long-tailed visual recognition

    Hong, Y., Han, S., Choi, K., Seo, S., Kim, B., and Chang, B. Disentangling label distribution for long-tailed visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

  41. [41]

    Cost-sensitive support vector machines

    Iranmehr, A., Masnadi - Shirazi, H., and Vasconcelos, N. Cost-sensitive support vector machines. Neurocomputing, 343: 0 50--64, 2019

  42. [42]

    A., Brown, M., Yang, M.-H., Wang, L., and Gong, B

    Jamal, M. A., Brown, M., Yang, M.-H., Wang, L., and Gong, B. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 7610--7619, 2020

  43. [43]

    Risk-controlled selective prediction for regression deep neural network models

    Jiang, W., Zhao, Y., and Wang, Z. Risk-controlled selective prediction for regression deep neural network models. In International Joint Conference on Neural Networks, pp.\ 1--8, 2020

  44. [44]

    Balanced meta-softmax for long-tailed visual recognition

    Jiawei, R., Yu, C., Ma, X., Zhao, H., Yi, S., et al. Balanced meta-softmax for long-tailed visual recognition. In Advances in Neural Information Processing Systems, 2020

  45. [45]

    Decoupling representation and classifier for long-tailed recognition

    Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., and Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. In International Conference on Learning Representations, 2020

  46. [46]

    Maximum class separation as inductive bias in one matrix

    Kasarla, T., Burghouts, G., Van Spengler, M., Van Der Pol, E., Cucchiara, R., and Mettes, P. Maximum class separation as inductive bias in one matrix. In Advances in Neural Information Processing Systems, pp.\ 19553--19566, 2022

  47. [47]

    Towards unbiased and accurate deferral to multiple experts

    Keswani, V., Lease, M., and Kenthapadi, K. Towards unbiased and accurate deferral to multiple experts. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp.\ 154--165, 2021

  48. [48]

    W., Shen, J., and Shao, L

    Khan, S., Hayat, M., Zamir, S. W., Shen, J., and Shao, L. Striking the right balance with uncertainty. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 103--112, 2019

  49. [49]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  50. [50]

    R., Paraskevas, O., Oymak, S., and Thrampoulidis, C

    Kini, G. R., Paraskevas, O., Oymak, S., and Thrampoulidis, C. Label-imbalanced and group-sensitive classification under overparameterization. In Advances in Neural Information Processing Systems, pp.\ 18970--18983, 2021

  51. [51]

    Learning multiple layers of features from tiny images

    Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Toronto University, 2009

  52. [52]

    and Matwin, S

    Kubat, M. and Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection. In International Conference on Machine Learning, 1997

  53. [53]

    and Yang, X

    Le, Y. and Yang, X. Tiny imagenet visual recognition challenge. CS 231N, 7 0 (7): 0 3, 2015

  54. [54]

    Size-invariance matters: Rethinking metrics and losses for imbalanced multi-object salient object detection

    Li, F., Xu, Q., Bao, S., Yang, Z., Cong, R., Cao, X., and Huang, Q. Size-invariance matters: Rethinking metrics and losses for imbalanced multi-object salient object detection. In International Conference on Machine Learning, 2024

  55. [55]

    When no-rejection learning is optimal for regression with rejection

    Li, X., Liu, S., Sun, C., and Wang, H. When no-rejection learning is optimal for regression with rejection. arXiv preprint arXiv:2307.02932, 2023

  56. [56]

    Focal loss for dense object detection

    Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll \'a r, P. Focal loss for dense object detection. In International Conference on Computer Vision, pp.\ 2980--2988, 2017

  57. [57]

    Incorporating uncertainty in learning to defer algorithms for safe computer-aided diagnosis

    Liu, J., Gallego, B., and Barbieri, S. Incorporating uncertainty in learning to defer algorithms for safe computer-aided diagnosis. Scientific Reports, 12 0 (1): 0 1762, 2022

  58. [58]

    Elta: An enhancer against long-tail for aesthetics-oriented models

    Liu, L., He, S., Ming, A., Xie, R., and Ma, H. Elta: An enhancer against long-tail for aesthetics-oriented models. In International Conference on Machine Learning, 2024

  59. [59]

    Exploratory undersampling for class-imbalance learning

    Liu, X.-Y., Wu, J., and Zhou, Z.-H. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, 39 0 (2): 0 539--550, 2008

  60. [60]

    Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S. X. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 2537--2546, 2019

  61. [61]

    Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  62. [62]

    Learning adversarially fair and transferable representations

    Madras, D., Creager, E., Pitassi, T., and Zemel, R. Learning adversarially fair and transferable representations. arXiv preprint arXiv:1802.06309, 2018

  63. [63]

    Theory and Algorithms for Learning with Multi-Class Abstention and Multi-Expert Deferral

    Mao, A. Theory and Algorithms for Learning with Multi-Class Abstention and Multi-Expert Deferral. PhD thesis, New York University, 2025

  64. [64]

    Two-stage learning to defer with multiple experts

    Mao, A., Mohri, C., Mohri, M., and Zhong, Y. Two-stage learning to defer with multiple experts. In Advances in Neural Information Processing Systems, 2023 a

  65. [65]

    H -consistency bounds: Characterization and extensions

    Mao, A., Mohri, M., and Zhong, Y. H -consistency bounds: Characterization and extensions. In Advances in Neural Information Processing Systems, 2023 b

  66. [66]

    H -consistency bounds for pairwise misranking loss surrogates

    Mao, A., Mohri, M., and Zhong, Y. H -consistency bounds for pairwise misranking loss surrogates. In International Conference on Machine learning, 2023 c

  67. [67]

    Ranking with abstention

    Mao, A., Mohri, M., and Zhong, Y. Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023 d

  68. [68]

    Cross-entropy loss functions: Theoretical analysis and applications

    Mao, A., Mohri, M., and Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning, 2023 e

  69. [69]

    Structured prediction with stronger consistency guarantees

    Mao, A., Mohri, M., and Zhong, Y. Structured prediction with stronger consistency guarantees. In Advances in Neural Information Processing Systems, pp.\ 46903--46937, 2023 f

  70. [70]

    Principled approaches for learning to defer with multiple experts

    Mao, A., Mohri, M., and Zhong, Y. Principled approaches for learning to defer with multiple experts. In International Symposium on Artificial Intelligence and Mathematics, 2024 a

  71. [71]

    Predictor-rejector multi-class abstention: Theoretical analysis and algorithms

    Mao, A., Mohri, M., and Zhong, Y. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. In International Conference on Algorithmic Learning Theory, pp.\ 822--867, 2024 b

  72. [72]

    Theoretically grounded loss functions and algorithms for score-based multi-class abstention

    Mao, A., Mohri, M., and Zhong, Y. Theoretically grounded loss functions and algorithms for score-based multi-class abstention. In International Conference on Artificial Intelligence and Statistics, pp.\ 4753--4761, 2024 c

  73. [73]

    H -consistency guarantees for regression

    Mao, A., Mohri, M., and Zhong, Y. H -consistency guarantees for regression. In International Conference on Machine Learning, pp.\ 34712--34737, 2024 d

  74. [74]

    Multi-label learning with stronger consistency guarantees

    Mao, A., Mohri, M., and Zhong, Y. Multi-label learning with stronger consistency guarantees. In Advances in Neural Information Processing Systems, 2024 e

  75. [75]

    Regression with multi-expert deferral

    Mao, A., Mohri, M., and Zhong, Y. Regression with multi-expert deferral. In International Conference on Machine Learning, pp.\ 34738--34759, 2024 f

  76. [76]

    A universal growth rate for learning with smooth surrogate losses

    Mao, A., Mohri, M., and Zhong, Y. A universal growth rate for learning with smooth surrogate losses. In Advances in Neural Information Processing Systems, 2024 g

  77. [77]

    Realizable H -consistent and B ayes-consistent loss functions for learning to defer

    Mao, A., Mohri, M., and Zhong, Y. Realizable H -consistent and B ayes-consistent loss functions for learning to defer. In Advances in Neural Information Processing Systems, 2024 h

  78. [78]

    Mastering multiple-expert routing: Realizable H -consistency and strong guarantees for learning to defer

    Mao, A., Mohri, M., and Zhong, Y. Mastering multiple-expert routing: Realizable H -consistency and strong guarantees for learning to defer. In International Conference on Machine Learning, 2025 a

  79. [79]

    Principled algorithms for optimizing generalized metrics in binary classification

    Mao, A., Mohri, M., and Zhong, Y. Principled algorithms for optimizing generalized metrics in binary classification. In International Conference on Machine Learning, 2025 b

  80. [80]

    Enhanced -consistency bounds

    Mao, A., Mohri, M., and Zhong, Y. Enhanced -consistency bounds. In International Conference on Algorithmic Learning Theory, 2025 c

Showing first 80 references.