arxiv: 2605.10521 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation

Yiqi Tian , Sangjoon Park , Bo Zeng , Pengfei Jin , Yujin Oh , Quanzheng Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical image segmentationintra-group hidden failuredistributionally robust optimizationmixture of expertssubgroup fairnessworst-case performanceequity-scaled performance

0 comments

The pith

DuetFair couples inter- and intra-subgroup robustness to reduce hidden failures in medical image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical image segmentation models often show uneven performance across patient subgroups, and standard fairness approaches that only equalize average subgroup scores can mask high-loss cases inside those subgroups. The paper introduces DuetFair as a dual-axis framework that simultaneously adapts across subgroups and strengthens robustness inside each one. FairDRO realizes this by pairing a distribution-aware mixture-of-experts with subgroup-conditioned distributionally robust optimization loss aggregation. If the mechanism works, segmentation models become more reliable for the hardest cases within demographic or clinical groups while preserving or improving equity across groups.

Core claim

DuetFair is a dual-axis fairness framework that jointly considers inter-subgroup adaptation and intra-subgroup robustness. Implemented as FairDRO, it combines distribution-aware mixture-of-experts with subgroup-conditioned distributionally robust optimization loss aggregation. This design reduces intra-group hidden failures while maintaining inter-group equity, delivering the best equity-scaled performance on Harvard-FairSeg and lifting worst-group Dice by 3.5 points under tumor-stage grouping and 4.1 points under institution grouping on the 3D radiotherapy cohort.

What carries the argument

The DuetFair mechanism, which couples inter-subgroup adaptation with intra-subgroup robustness through distribution-aware mixture-of-experts and subgroup-conditioned DRO loss aggregation.

If this is right

Segmentation models can improve worst-case performance inside subgroups without sacrificing equity between subgroups.
The dual-axis approach yields the highest equity-scaled scores on Harvard-FairSeg.
Worst-case subgroup Dice improves under both age- and race-based groupings on HAM10000.
On 3D radiotherapy targets, worst-group Dice rises by 3.5 points under tumor-stage grouping and 4.1 points under institution grouping over the strongest baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Single-axis fairness methods that ignore within-group variation are likely insufficient when medical data contain substantial internal heterogeneity.
The same dual-robustness pattern could be tested on other medical imaging tasks such as classification or detection where subgroup definitions are clinically meaningful.
Careful monitoring of routing behavior in the mixture-of-experts component would be needed to confirm that gains do not come from overfitting to fixed subgroup labels.

Load-bearing premise

The combination of distribution-aware mixture-of-experts and subgroup-conditioned DRO loss aggregation will simultaneously reduce intra-group hidden failures and maintain or improve inter-group equity without introducing new optimization instabilities or overfitting to the chosen subgroup definitions.

What would settle it

A new medical segmentation dataset with high within-subgroup heterogeneity on which FairDRO shows no gain in worst-group Dice or equity-scaled metrics relative to standard DRO baselines, or exhibits training instability.

Figures

Figures reproduced from arXiv: 2605.10521 by Bo Zeng, Pengfei Jin, Quanzheng Li, Sangjoon Park, Yiqi Tian, Yujin Oh.

**Figure 1.** Figure 1: Overview. (a) We proposed DuetFair mechanism, which characterizes fairness-aware medical segmentation along two complementary axes: inter-subgroup heterogeneity and intra-subgroup variation. (b) FairDRO, designed under the guidance of DuetFair. It combines subgroup-aware dMoE with a DRO loss applied within each subgroup, capturing inter-subgroup heterogeneity while emphasizing hard samples within each sub… view at source ↗

**Figure 2.** Figure 2: Per-subgroup Dice distribution for radiotherapy target segmentation, and its corresponding [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Medical image segmentation models can perform unevenly across subgroups. Most existing fairness methods focus on improving average subgroup performance, implicitly treating each subgroup as internally homogeneous. However, this can hide difficult cases within a subgroup, where high-loss samples are obscured by the subgroup mean. We call this problem \textbf{intra-group hidden failure}. To solve this, we propose \textbf{DuetFair} mechanism, a dual-axis fairness framework that jointly considers inter-subgroup adaptation and intra-subgroup robustness. Based on DuetFair, we introduce \textbf{FairDRO}, which combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation. This design allows the model to adapt across subgroups while also reducing hidden failures within each subgroup. We evaluate FairDRO on three medical image segmentation benchmarks with varying degrees of within-group heterogeneity. FairDRO achieves the best equity-scaled performance on Harvard-FairSeg and improves worst-case subgroup performance on HAM10000 under both age- and race-based grouping schemes. On the 3D radiotherapy target cohort, FairDRO further improves worst-group Dice by 3.5 points ($\uparrow 6.0\%$) under the tumor-stage grouping and by 4.1 points ($\uparrow 7.4\%$) under the institution grouping over the strongest baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes DuetFair, a dual-axis fairness framework for medical image segmentation that jointly addresses inter-subgroup performance disparities and intra-group hidden failures (high-loss samples obscured by subgroup averages). It introduces FairDRO, which combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation. Evaluations on Harvard-FairSeg, HAM10000 (age- and race-based groupings), and a 3D radiotherapy cohort report that FairDRO achieves the best equity-scaled performance on the first benchmark and improves worst-group Dice by 3.5 points (↑6.0%) under tumor-stage grouping and 4.1 points (↑7.4%) under institution grouping on the third, over the strongest baseline.

Significance. If the central claims hold, the work would advance fairness research in medical imaging by explicitly targeting intra-subgroup heterogeneity that standard worst-group or average-subgroup methods overlook. The multi-benchmark evaluation with concrete worst-case and equity metrics is a positive feature. However, the significance is limited by the absence of direct evidence that intra-group hidden failures were reduced, which is load-bearing for the DuetFair coupling claim.

major comments (2)

[Abstract] Abstract: The central claim is that FairDRO jointly reduces intra-group hidden failures (via dMoE + subgroup-conditioned DRO) while preserving inter-group equity. However, all reported results address only inter-subgroup quantities (worst-group Dice on HAM10000 and the 3D cohort; equity-scaled performance on Harvard-FairSeg). No direct intra-subgroup metrics (within-subgroup loss variance, max-loss samples per subgroup, or ablation isolating hidden-failure reduction) are provided, so the intra-axis contribution and the coupling mechanism remain unverified.
[Abstract] Abstract and experimental claims: Concrete numerical gains are stated (e.g., +3.5 Dice points, ↑6.0% and ↑7.4% on the 3D cohort) without accompanying details on statistical testing, error bars, number of runs, or ablation studies that isolate dMoE versus DRO contributions. This leaves the source of the reported improvements and the robustness of the performance claims unclear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental validation and direct evidence for our claims, which we address below. We will revise the manuscript to incorporate additional analyses and details.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim is that FairDRO jointly reduces intra-group hidden failures (via dMoE + subgroup-conditioned DRO) while preserving inter-group equity. However, all reported results address only inter-subgroup quantities (worst-group Dice on HAM10000 and the 3D cohort; equity-scaled performance on Harvard-FairSeg). No direct intra-subgroup metrics (within-subgroup loss variance, max-loss samples per subgroup, or ablation isolating hidden-failure reduction) are provided, so the intra-axis contribution and the coupling mechanism remain unverified.

Authors: We agree that the current results emphasize inter-subgroup metrics and that direct intra-subgroup evidence would more clearly substantiate the intra-axis contribution and the DuetFair coupling. While the subgroup-conditioned DRO component is designed to upweight high-loss samples within each subgroup (thereby targeting hidden failures), and the reported worst-group and equity-scaled gains are consistent with this effect, we acknowledge the absence of explicit intra-subgroup diagnostics in the presented evaluations. In the revision we will add: (1) within-subgroup loss variance and the fraction of max-loss samples per subgroup before/after FairDRO, (2) qualitative visualization of resolved hidden-failure cases, and (3) an ablation that isolates the DRO term's impact on intra-group variance while holding inter-group adaptation fixed. These additions will directly verify the intra-axis and the coupling mechanism. revision: yes
Referee: [Abstract] Abstract and experimental claims: Concrete numerical gains are stated (e.g., +3.5 Dice points, ↑6.0% and ↑7.4% on the 3D cohort) without accompanying details on statistical testing, error bars, number of runs, or ablation studies that isolate dMoE versus DRO contributions. This leaves the source of the reported improvements and the robustness of the performance claims unclear.

Authors: We concur that reporting statistical details and component-wise ablations is necessary to establish the robustness and source of the gains. The numerical improvements were obtained from our benchmark evaluations, yet the manuscript does not currently include multi-run statistics or isolated ablations. In the revised version we will: (i) report all key metrics as mean ± standard deviation over at least five independent random seeds, (ii) add error bars to tables and figures, (iii) include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank) for the reported Dice improvements, and (iv) provide ablations that separately disable dMoE and the subgroup-conditioned DRO term to quantify each component's contribution. These changes will clarify the origin of the gains and strengthen the experimental claims. revision: yes

Circularity Check

0 steps flagged

Empirical claims on held-out test sets show no derivation-level circularity

full rationale

The paper introduces DuetFair/FairDRO as a combination of distribution-aware mixture-of-experts and subgroup-conditioned DRO, then reports concrete improvements (worst-group Dice, equity-scaled performance) measured on held-out test partitions of Harvard-FairSeg, HAM10000, and a 3D radiotherapy cohort. No equations, fitted parameters, or self-citations are presented that reduce the reported metrics to the inputs by construction; the performance numbers are external to the training objective. The absence of explicit intra-subgroup variance metrics is an evidence gap, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on standard deep learning assumptions plus the novel coupling of dMoE and subgroup-conditioned DRO. No explicit free parameters, axioms, or invented entities beyond the new fairness framing are detailed.

invented entities (1)

intra-group hidden failure no independent evidence
purpose: To name the phenomenon where high-loss samples are obscured by subgroup-average performance metrics
New conceptual term introduced to motivate the intra-subgroup robustness axis.

pith-pipeline@v0.9.0 · 5557 in / 1304 out tokens · 36407 ms · 2026-05-12T04:37:29.698354+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FairDRO combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation... Rrob_g(θ, ϕ) := sup_{Qg ∈ Ug(bPg)} E[ℓ(fdMoE_θ,ϕ(x,g), y)]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify intra-subgroup hidden failures... DuetFair views subgroup fairness as a joint problem of inter-group adaptation and intra-group robustness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

[1]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015
[2]

Fairseg: A large-scale medical image segmentation dataset for fairness learning using segment anything model with fair error-bound scaling

Yu Tian, Min Shi, Yan Luo, Ava Kouhana, Tobias Elze, and Mengyu Wang. Fairseg: A large-scale medical image segmentation dataset for fairness learning using segment anything model with fair error-bound scaling. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[3]

Llm-driven multimodal target volume contouring in radiation oncology.Nature Communications, 15(1):9186, 2024

Yujin Oh, Sangjoon Park, Hwa Kyung Byun, Yeona Cho, Ik Jae Lee, Jin Sung Kim, and Jong Chul Ye. Llm-driven multimodal target volume contouring in radiation oncology.Nature Communications, 15(1):9186, 2024

work page 2024
[4]

arXiv preprint arXiv:2304.13785 (2023)

Kaidong Zhang and Dong Liu. Customized segment anything model for medical image segmentation.arXiv preprint arXiv:2304.13785, 2023

work page arXiv 2023
[5]

Sam-med3d-moe: Towards a non-forgetting segment anything model via mixture of experts for 3d medical image segmentation

Guoan Wang, Jin Ye, Junlong Cheng, Tianbin Li, Zhaolin Chen, Jianfei Cai, Junjun He, and Bohan Zhuang. Sam-med3d-moe: Towards a non-forgetting segment anything model via mixture of experts for 3d medical image segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 552–561. Springer, 2024

work page 2024
[6]

Clinical evaluation of atlas-and deep learning- based automatic segmentation of multiple organs and clinical target volumes for breast cancer

Min Seo Choi, Byeong Su Choi, Seung Yeun Chung, Nalee Kim, Jaehee Chun, Yong Bae Kim, Jee Suk Chang, and Jin Sung Kim. Clinical evaluation of atlas-and deep learning- based automatic segmentation of multiple organs and clinical target volumes for breast cancer. Radiotherapy and Oncology, 153:139–145, 2020

work page 2020
[7]

Distribution-aware fairness learning in medical image segmentation from a control-theoretic perspective.Forty-second International Conference on Machine Learning, 2025

Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop yoon, Jin Sung Kim, Kyungsang Kim, Xiang Li, and Quanzheng Li. Distribution-aware fairness learning in medical image segmentation from a control-theoretic perspective.Forty-second International Conference on Machine Learning, 2025

work page 2025
[8]

A translational perspective towards clinical ai fairness.NPJ digital medicine, 6(1):172, 2023

Mingxuan Liu, Yilin Ning, Salinelat Teixayavong, Mayli Mertens, Jie Xu, Daniel Shu Wei Ting, Lionel Tim-Ee Cheng, Jasmine Chiat Ling Ong, Zhen Ling Teo, Ting Fang Tan, et al. A translational perspective towards clinical ai fairness.NPJ digital medicine, 6(1):172, 2023

work page 2023
[9]

Fairdomain: Achieving fairness in cross-domain medical image segmentation and classification

Yu Tian, Congcong Wen, Min Shi, Muhammad Muneeb Afzal, Hao Huang, Muhammad Osama Khan, Yan Luo, Yi Fang, and Mengyu Wang. Fairdomain: Achieving fairness in cross-domain medical image segmentation and classification. InEuropean Conference on Computer Vision, pages 251–271. Springer, 2024

work page 2024
[10]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731, 2019

work page internal anchor Pith review arXiv 1911
[11]

Group distributionally robust optimization-driven reinforcement learning for llm reasoning, 2026

Kishan Panaganti, Zhenwen Liang, Wenhao Yu, Haitao Mi, and Dong Yu. Group distributionally robust optimization-driven reinforcement learning for llm reasoning, 2026

work page 2026
[12]

Fairdiff: Fair segmentation with point-image diffusion

Wenyi Li, Haoran Xu, Guiyu Zhang, Huan-ang Gao, Mingju Gao, Mengyu Wang, and Hao Zhao. Fairdiff: Fair segmentation with point-image diffusion. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 617–628. Springer, 2024

work page 2024
[13]

Just train twice: Improving group robustness without training group information

Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. InInternational Conference on Machine Learning, pages 6781–6792. PMLR, 2021

work page 2021
[14]

Simple data balancing achieves competitive worst-group-accuracy

Badr Youbi Idrissi, Martin Arjovsky, Mohammad Pezeshki, and David Lopez-Paz. Simple data balancing achieves competitive worst-group-accuracy. InConference on Causal Learning and Reasoning, pages 336–351. PMLR, 2022. 11

work page 2022
[15]

No subclass left behind: Fine-grained robustness in coarse-grained classification problems.Advances in Neural Information Processing Systems, 33:19339–19352, 2020

Nimit Sohoni, Jared Dunnmon, Geoffrey Angus, Albert Gu, and Christopher Ré. No subclass left behind: Fine-grained robustness in coarse-grained classification problems.Advances in Neural Information Processing Systems, 33:19339–19352, 2020

work page 2020
[16]

Adaptive sampling for stochastic risk-averse learning.Advances in Neural Information Processing Systems, 33:1036– 1047, 2020

Sebastian Curi, Kfir Y Levy, Stefanie Jegelka, and Andreas Krause. Adaptive sampling for stochastic risk-averse learning.Advances in Neural Information Processing Systems, 33:1036– 1047, 2020

work page 2020
[17]

Tilted empirical risk minimization

Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization. arXiv preprint arXiv:2007.01162, 2020

work page arXiv 2007
[18]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017
[19]

Training region-based object detectors with online hard example mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 761–769, 2016

work page 2016
[20]

The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, 5(1):1–9, 2018

Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, 5(1):1–9, 2018

work page 2018
[21]

Multi-expert distributionally robust optimization for out-of-distribution generalization

Jinyong Jeong, Hyungu Kahng, and Seoung Bum Kim. Multi-expert distributionally robust optimization for out-of-distribution generalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[22]

Mixture of multicenter experts in multimodal generative ai for advanced radiotherapy target delineation.arXiv preprint arXiv:2410.00046, 2024

Yujin Oh, Sangjoon Park, Xiang Li, Wang Yi, Jonathan Paly, Jason Efstathiou, Annie Chan, Jun Won Kim, Hwa Kyung Byun, Ik Jae Lee, et al. Mixture of multicenter experts in multimodal generative ai for advanced radiotherapy target delineation.arXiv preprint arXiv:2410.00046, 2024

work page arXiv 2024
[23]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation.arXiv preprint arXiv:2102.04306, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

3D U-Net: learning dense volumetric segmentation from sparse annotation

Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3D U-Net: learning dense volumetric segmentation from sparse annotation. InMedical Image Computing and Computer-Assisted Intervention, pages 424–432. Springer, 2016

work page 2016
[25]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019
[26]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Learning adversarially fair and transferable representations

David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair and transferable representations. InInternational Conference on Machine Learning, pages 3384–3393. PMLR, 2018

work page 2018
[28]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017. 12 A Appendix A.1 Two-level reweighting and composite sample weights This appendix considers a natural two-level weighted ERM objecti...

work page internal anchor Pith review Pith/arXiv arXiv 2017