arxiv: 2605.08024 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

MPD²-Router: Mask-aware Multi-expert Prior-regularized Dual-head Deferral Router in Glaucoma Screening and Diagnosis

Wenxin Zhan

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords glaucoma screeninglearning-to-defermulti-expert allocationdeferral policyclinical cost optimizationuncertainty fusionPareto-optimal routingcross-domain robustness

0 comments

The pith

A multi-expert deferral router for glaucoma screening lowers clinical costs and raises Matthews correlation coefficient over AI-only decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops MPD²-Router to handle the decision of whether and to whom to defer glaucoma cases in screening, accounting for expert availability, case difficulty, and asymmetric risks of errors. It integrates multiple signals into a dual-head policy trained with cost-sensitive losses and regularizers that encourage balanced expert use without collapse. Evaluated on three cross-national datasets using a fixed backbone model, it demonstrates reduced overall clinical costs and better MCC at moderate deferral levels while remaining robust to domain shifts. This setup addresses real-world constraints in deploying AI for medical diagnosis where human experts have limited capacity and varying expertise.

Core claim

MPD²-Router recasts ophthalmic triage as constrained human-AI routing by coupling a dual-head deferral/allocation policy with mask-aware Gumbel-sigmoid gating that strictly enforces per-sample availability, fusing uncertainty, morphology, image-quality, and OOD signals. Training employs an asymmetric cost-sensitive objective with an augmented-Lagrangian deferral budget, a group-specific distribution prior, and a rank-majorization JS regularizer that jointly prevent expert collapse without forcing uniform allocation. Across three cross-national glaucoma cohorts with a frozen REFUGE-trained backbone, it substantially lowers clinical cost and improves MCC over AI-only at a moderate deferral率.

What carries the argument

The mask-aware multi-expert prior-regularized dual-head deferral router, which uses Gumbel-sigmoid gating to enforce expert availability per sample and regularizers to maintain balanced allocation while optimizing an asymmetric cost function.

Load-bearing premise

The fused uncertainty, morphology, image-quality, and OOD signals combined with the group-specific prior and rank-majorization regularizer produce accurate per-sample deferral decisions and balanced expert allocation even under real-world availability patterns not seen in the training cohorts.

What would settle it

A deployment on a new cross-national glaucoma cohort where expert availability patterns differ markedly from the studied ones, resulting in either higher clinical costs than AI-only or severely imbalanced expert utilization.

Figures

Figures reproduced from arXiv: 2605.08024 by Wenxin Zhan.

**Figure 2.** Figure 2: Overview of MPD2 -Router. A retinal image is processed by three complementary branches: a frozen AI classifier for diagnostic logits, a glaucoma segmentation model for structural biomarkers such as vCDR and aCDR, and risk-feature extractors for uncertainty, image quality, and OOD signals. These signals are fused into an embedding hi , which is passed to a dual-head router. The defer head estimates the prob… view at source ↗

**Figure 3.** Figure 3: Retinal-region cropping used before similarity-based retrieval. Moreover, this assumption is consistent with our risk-stratified analysis in [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: F1 versus total, deferral, and clinical cost. MPD [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: MCC versus total, deferral, and clinical cost. MPD [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Matched-subset F1 comparison between individual human experts and MPD2 -Router. Each row evaluates one expert and MPD2 -Router on the same cases where that expert is available; n denotes the matched availability size, and gray lines show the F1 gap. value and which available expert is most appropriate for the case, rather than treating all experts as exchangeable. Matched comparison with MPD2 -Router [PIT… view at source ↗

**Figure 7.** Figure 7: Spatial performance map over the test distribution. MPD [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

read the original abstract

Learning-to-defer (L2D) can make glaucoma screening safer by routing difficult/uncertain cases to humans, yet standard formulations overlook expert availability, heterogeneous readers behavior, workload imbalance, asymmetric diagnostic harm, case difficulty from morphology and deployment shift. We introduce MPD$^2$-Router, a mask-aware multi-expert deferral framework that recasts ophthalmic triage as constrained human--AI routing: whether to defer and to which available expert. It couples a dual-head deferral/allocation policy with mask-aware Gumbel--sigmoid gating that strictly enforces per-sample availability, and fuses uncertainty, morphology, image-quality, and OOD signals. Training uses an asymmetric cost-sensitive objective with an augmented-Lagrangian deferral budget, a group-specific distribution prior, and a rank-majorization JS regularizer that jointly prevent expert collapse without forcing uniform allocation. Across three cross-national glaucoma cohorts (REFUGE, CHAKSU, ORIGA) with a frozen REFUGE-trained backbone, MPD$^2$-Router substantially lowers clinical cost and improves MCC over AI-only at a moderate deferral rate. It is Pareto-optimal in F1--MCC--cost, robust under cross-domain shift, and yields balanced expert utilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete multi-expert deferral router for glaucoma screening that bakes in availability masks and regularizers to avoid collapse, but the performance claims rest on unspecified experiments.

read the letter

The main takeaway is a routing system that treats deferral as a constrained allocation problem across available experts rather than a simple AI-vs-human choice. It combines mask-aware Gumbel-sigmoid gating, a dual-head policy, an augmented-Lagrangian budget, a group-specific prior, and a rank-majorization JS regularizer to keep utilization balanced while respecting per-sample availability and asymmetric costs. That specific bundle is not standard in the L2D literature it cites, so the architecture itself is the clearest addition. It also fuses uncertainty, morphology, image quality, and OOD signals in one forward pass, which is a reasonable engineering choice for ophthalmic triage. The three-cohort setup with a frozen backbone and cross-domain testing shows the authors are thinking about deployment shift, which is better than many medical AI papers. The soft spots sit in the evaluation. The abstract states gains in cost and MCC, Pareto optimality, and robustness, yet supplies no numeric deltas, baseline details, statistical tests, or ablation numbers. Without those, it is difficult to judge whether the regularizers actually deliver the claimed balance or whether the gains survive realistic availability patterns. The stress-test worry about synthetic or fixed masks not matching real clinical schedules is still live; if expert availability correlates with case difficulty or image quality in ways the cohorts do not capture, the balanced-utilization result could shrink. This work is aimed at researchers building constrained deferral systems for medical imaging or clinic triage, not at theorists looking for general L2D advances. A reader who needs practical handling of expert workload and availability constraints will find usable pieces. It deserves a serious referee because the problem framing is grounded and the architecture is explicit, even though the current evidence is thin. I would send it to review with a request for full tables, ablations, and sensitivity checks on the availability masks.

Referee Report

2 major / 2 minor

Summary. The paper introduces MPD²-Router, a mask-aware multi-expert prior-regularized dual-head deferral router for glaucoma screening and diagnosis. It recasts ophthalmic triage as constrained human-AI routing using a dual-head deferral/allocation policy with mask-aware Gumbel-sigmoid gating to strictly enforce per-sample expert availability, fusing uncertainty, morphology, image-quality, and OOD signals. Training uses an asymmetric cost-sensitive objective with augmented-Lagrangian deferral budget, a group-specific distribution prior, and a rank-majorization JS regularizer to prevent expert collapse without forcing uniform allocation. On three cross-national cohorts (REFUGE, CHAKSU, ORIGA) with a frozen REFUGE-trained backbone, it claims to substantially lower clinical cost and improve MCC over AI-only at moderate deferral rates, while being Pareto-optimal in F1-MCC-cost, robust under cross-domain shift, and yielding balanced expert utilization.

Significance. If the results hold under rigorous verification, the work has moderate significance for learning-to-defer methods in medical AI by addressing multiple practical gaps including expert availability, heterogeneous reader behavior, asymmetric diagnostic harm, and workload imbalance. The mask-aware gating, multi-signal fusion, and regularizers represent thoughtful engineering to avoid collapse while respecting constraints. Credit is due for the comprehensive framework that models real deployment issues more explicitly than standard L2D formulations, potentially aiding safer screening systems if the empirical gains prove reproducible and generalizable.

major comments (2)

[Abstract] Abstract: The central performance claims (substantial cost reduction, MCC improvement, Pareto-optimality in F1-MCC-cost, robustness, balanced utilization) are stated without any numeric deltas, baseline comparisons, statistical tests, ablation results, or specific deferral rates. This absence makes the empirical contribution impossible to assess for magnitude or reliability from the abstract alone, which is load-bearing for the paper's primary assertion of practical superiority.
[Evaluation on cohorts] Evaluation setup (cohorts and masks): The balanced expert utilization and robustness claims rest on the mask-aware Gumbel-sigmoid gating plus group-specific prior and rank-majorization JS regularizer. However, the three cohorts likely rely on fixed or synthetic per-sample availability masks; no evidence is provided that these capture real-world correlations between availability and case difficulty/image quality/temporal factors. If such correlations exist, the reported allocation balance and cost reductions could fail to hold, directly undermining the cross-domain and deployment claims.

minor comments (2)

[Abstract] The title and abstract introduce MPD²-Router without immediately expanding the acronym or clarifying the dual-head and mask-aware components for readers unfamiliar with the subfield.
[Methods] Notation for the regularizer (rank-majorization JS) and augmented-Lagrangian terms could be clarified with explicit equations or pseudocode in the methods to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (substantial cost reduction, MCC improvement, Pareto-optimality in F1-MCC-cost, robustness, balanced utilization) are stated without any numeric deltas, baseline comparisons, statistical tests, ablation results, or specific deferral rates. This absence makes the empirical contribution impossible to assess for magnitude or reliability from the abstract alone, which is load-bearing for the paper's primary assertion of practical superiority.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to gauge the scale of improvements. In the revised manuscript we will incorporate key quantitative results, including approximate percentage reductions in clinical cost, MCC gains at moderate deferral rates (e.g., 20-30%), and references to statistical significance and Pareto-optimality, while respecting abstract length limits. revision: yes
Referee: [Evaluation on cohorts] Evaluation setup (cohorts and masks): The balanced expert utilization and robustness claims rest on the mask-aware Gumbel-sigmoid gating plus group-specific prior and rank-majorization JS regularizer. However, the three cohorts likely rely on fixed or synthetic per-sample availability masks; no evidence is provided that these capture real-world correlations between availability and case difficulty/image quality/temporal factors. If such correlations exist, the reported allocation balance and cost reductions could fail to hold, directly undermining the cross-domain and deployment claims.

Authors: The availability masks are synthetically generated from dataset-provided expert group distributions to enforce realistic per-sample constraints. We acknowledge that public cohorts lack explicit real-world availability logs annotated with difficulty or quality correlations, so direct validation of such correlations is not possible with existing data. The mask-aware gating is deliberately general and accepts arbitrary masks; we will add a dedicated limitations paragraph and sensitivity experiments with artificially correlated masks in the revision to quantify robustness. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent evaluation

full rationale

The paper introduces MPD²-Router as a novel architecture combining mask-aware Gumbel-sigmoid gating, dual-head policy, asymmetric cost-sensitive loss, augmented-Lagrangian budget, group-specific prior, and rank-majorization JS regularizer. These are presented as training mechanisms to enforce availability constraints and prevent collapse. Performance metrics (MCC, F1, cost, Pareto optimality) are reported from experiments on REFUGE/CHAKSU/ORIGA cohorts with frozen backbone; no equation or claim reduces the reported gains to a fitted parameter renamed as prediction, nor to a self-citation chain. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

Abstract-only review; the framework rests on standard supervised learning assumptions plus several introduced training mechanisms whose parameters are not enumerated.

free parameters (3)

augmented-Lagrangian deferral budget
Explicitly introduced to constrain deferral rate; value chosen during training.
group-specific distribution prior parameters
Used to prevent expert collapse; fitted or set per expert group.
rank-majorization JS regularizer strength
Hyperparameter controlling balance between experts.

axioms (2)

domain assumption Expert availability mask is known and accurate per sample at inference time
Required for the mask-aware gating to enforce per-sample availability.
domain assumption Asymmetric diagnostic harm can be quantified into a cost matrix usable in the objective
Central to the cost-sensitive training.

pith-pipeline@v0.9.0 · 5518 in / 1634 out tokens · 38773 ms · 2026-05-11T02:40:02.650345+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

asymmetric cost-sensitive objective with an augmented-Lagrangian deferral budget, a group-specific distribution prior, and a rank-majorization JS regularizer
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mask-aware Gumbel–sigmoid gating that strictly enforces per-sample availability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Cost-sensitive learning to defer to multiple experts with workload constraints.arXiv preprint arXiv:2403.06906,

Jean V Alves, Diogo Leitão, Sérgio Jesus, Marco OP Sampaio, Javier Liébana, Pedro Saleiro, Mário AT Figueiredo, and Pedro Bizarro. Cost-sensitive learning to defer to multiple experts with workload constraints.arXiv preprint arXiv:2403.06906,

work page arXiv
[2]

Refuge2 challenge: A treasure trove for multi-dimension analysis and evaluation in glaucoma screening.arXiv preprint arXiv:2202.08994, 2022

Huihui Fang, Fei Li, Junde Wu, Huazhu Fu, Xu Sun, Xingxing Cao, Jaemin Son, Shuang Yu, Menglu Zhang, Chenglang Yuan, Cheng Bian, et al. Refuge2 challenge: Treasure for multi-domain learning in glaucoma assessment.arXiv preprint arXiv:2202.08994,

work page arXiv
[3]

Hemmer, S

Patrick Hemmer, Sebastian Schellhammer, Michael Vössing, Johannes Jakubik, and Gerhard Satzger. Forming effective human-ai teams: Building machine learning models that complement the capabilities of multiple experts.arXiv preprint arXiv:2206.07948,

work page arXiv
[4]

Towards unbiased and accurate deferral to multiple experts

Vijay Keswani, Matthew Lease, and Krishnaram Kenthapadi. Towards unbiased and accurate deferral to multiple experts. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 154–165,

work page 2021
[5]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The algorithmic automation problem: Prediction, triage, and human effort

Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil Mul- lainathan. The algorithmic automation problem: Prediction, triage, and human effort.arXiv preprint arXiv:1903.12220,

work page arXiv 1903
[7]

12 A Related Work Selective prediction, algorithmic triage, and learning to defer.Learning to defer (L2D) is closely related to selective prediction and classification with a reject option, where a model abstains on uncertain inputs to improve the risk–coverage trade-off [Geifman and El-Yaniv, 2017, 2019]. However, classical selective prediction treats ab...

work page 2017
[8]

further argued that automation is not simply a question of whether an algorithm outperforms humans on average, but an instance-wise allocation problem that depends on both algorithmic and human error. Subsequent work on differentiable triage formalized this division of labor and showed that models trained for full automation may be suboptimal when only a ...

work page 2021
[9]

established a two- stage multi-expert L2D framework with H-consistency and Bayes-consistency guarantees, where a predictor is first trained and a deferral function is then learned to assign each input to the most suitable expert. More recent theoretically grounded work has further studied surrogate design, realizable consistency, and cost-sensitive deferr...

work page 2024
[10]

considered learning to defer to a population of experts, using meta-learning to adapt to experts whose predictions were not observed during training. Clinical human–AI deferral.Clinical AI provides a particularly strong motivation for L2D because both AI models and human experts are imperfect, and their errors may be complementary. In medical imaging, Com...

work page 2023
[11]

Therefore, any activating violation must occur at a strict prefix t < k i

Form≥0, the inequality Ri(ki)> G i(ki) +m cannot hold, since it would require 1>1 +m. Therefore, any activating violation must occur at a strict prefix t < k i. Hence χi = 1 only when the sorted router allocation places more mass in its top t experts than the geometric reference profile allows, up to margin m. This is precisely the sense in which the pena...

work page 2020
[12]

surrogate

Unless otherwise stated, each study uses 80 Optuna trials. Within each trial, pruning and early stopping are driven by the constraint-aware validation score es_base + 10 es_violation, where es_violation measures soft deferral-budget violation. The best validation checkpoint under this rule is restored before trial evaluation. The outer Optuna objective us...

work page 1988
[13]

MPD 2-Router largely recovers this region, leaving residual errors sparse rather than spatially clustered

In contrast, the frozen AI classifier exhibits a localized failure region where accuracy collapses. MPD 2-Router largely recovers this region, leaving residual errors sparse rather than spatially clustered. Its deferral mass is concentrated on the AI failure region and suppressed where the frozen classifier is already reliable. Thus, MPD2-Router does not ...

work page doi:10.6084/m9 2020