pith. sign in

arxiv: 2605.20276 · v1 · pith:C6GNUVHBnew · submitted 2026-05-19 · 💻 cs.LG

OmniISR: A Unified Framework for Centralized and Federated Learning via Intermediate Supervision and Regularization

Pith reviewed 2026-05-21 08:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords federated learningcentralized learningintermediate supervisionmutual informationnegative entropyunified frameworkconvergence boundsclient drift
0
0 comments X

The pith

OmniISR unifies centralized, federated, and hybrid learning by adding mutual-information supervision and negative-entropy regularization at hidden layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to resolve the incompatibility between centralized learning, which pools data but encounters internal covariate shift, and federated learning, which preserves data locality but produces drifting local updates. It does so by introducing intermediate supervision via mutual information to align representations and negative-entropy regularization to maintain uncertainty at multiple hidden layers. This construction yields a single non-asymptotic convergence bound of order one over square root of training steps that remains valid across all training modes. Experiments confirm the framework narrows the gap between centralized and federated performance by 22.60 percent and secures wins on 37 out of 48 paired metrics.

Core claim

OmniISR fuses pure CL, pure FL, and hybrid CL-FL training modes via equipping intermediate supervision and regularization signals at multiple hidden layers, yielding a unified ISR-agnostic O(1/sqrt(T)) convergence bound, a federated drift-bound, a gradient-alignment guarantee, and an explicit escape-time bound while reducing the CL-FL gap by 22.60% and achieving 37/48 paired metric wins.

What carries the argument

intermediate supervision and regularization (ISR) using mutual-information supervision to align representations and negative-entropy regularization to penalize overconfidence at hidden layers

Load-bearing premise

Mutual information supervision and negative entropy regularization can be applied at hidden layers without creating new optimization conflicts or needing dataset-specific retuning that would break the non-asymptotic bounds.

What would settle it

A direct measurement of the convergence rate on a standard benchmark like CIFAR-10 when applying the ISR signals across CL, FL, and hybrid modes, checking if it stays within the claimed O(1/sqrt(T)) or if the CL-FL gap reduction disappears.

Figures

Figures reproduced from arXiv: 2605.20276 by Chen Zhang, Guangxu Zhu, Lei Zhou, Lisheng Wu, Ming Tang, Wei-Bin Kou, Yujiu Yang.

Figure 1
Figure 1. Figure 1: Overview of the mechanism of intermediate supervision and regularization in the proposed OmniISR framework. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the three modes of the proposed OmniISR framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of benefits of ISR mechanism, supposing three intermediate points within the model. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of why the proposed OmniISR framework can reduce client drift in federated setting, taking three clients in this toy example. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of how OmniISR helps to escape saddle point. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Practical escape time comparison between OmniISR-enabled [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation of the time of escape saddle point for the proposed OmniISR framework. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The impact of the number of intermediate points on OmniISR’s training performance. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The impact of the distance between adjacent intermediate points on OmniISR’s training performance. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

The global deployment of edge intelligence operates across heterogeneous legal frameworks. While some regions permit centralized learning (CL) via cloud data aggregation, others enforce strict data localization, necessitating federated learning (FL). This operational dichotomy introduces two incompatible optimization regimes (i.e., unbiased global gradients yet coupled with internal covariate shift in CL versus biased, drift-prone local updates in FL), resulting in that any naive integration of the two lacks rigorous theoretical guarantees. To fill this gap, we propose OmniISR, a unified framework that fuses pure CL, pure FL, and hybrid CL-FL training modes via equipping intermediate supervision and regularization (ISR) signals at multiple hidden layers. Specifically, we propose (i) to use mutual-information (MI) as intermediate supervision to align shifting internal covariate in CL and client-drifting representations in FL, and (ii) to adopt negative-entropy (NE) as intermediate regularizer to penalize overconfident prediction, preserve representational uncertainty, and avoid device-specific collapse. On the theory side, we derive (i) a unified, ISR-agnostic, and non-asymptotic O(1/sqrt(T)) convergence bound that shows the introduced ISR does not violate standard SGD convergence, (ii) a federated drift-bound that quantifies the ISR-reduced client drift, (iii) a gradient-alignment guarantee that ensures non-conflicting CL and FL updates under mild bias, and (iv) an explicit escape-time bound that indicates that CL-FL hybrid mixing enlarges effective stochasticity and accelerates escape from strict saddles. Extensive experiments demonstrate that OmniISR consistently improves model performance in both centralized and federated paradigms, reduces the CL-FL gap by 22.60%, and yields 37/48 paired metric wins across multiple FL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes OmniISR, a unified framework that fuses pure centralized learning (CL), pure federated learning (FL), and hybrid CL-FL modes by applying mutual-information (MI) supervision and negative-entropy (NE) regularization signals at multiple hidden layers. It claims four theoretical results—an ISR-agnostic non-asymptotic O(1/sqrt(T)) convergence bound, a federated drift bound quantifying ISR-reduced client drift, a gradient-alignment guarantee under mild bias, and an explicit escape-time bound showing hybrid mixing accelerates saddle escape—while reporting a 22.60% reduction in the CL-FL gap and 37/48 paired metric wins across FL algorithms.

Significance. If the ISR-agnostic bounds hold without hidden layer- or estimator-dependent constants, the work would meaningfully advance unification of CL and FL under heterogeneous privacy constraints by supplying non-asymptotic guarantees that standard SGD analyses do not automatically extend to intermediate signals. The explicit drift and alignment results, together with the escape-time analysis, address practically relevant issues of covariate shift and client drift; the experimental wins provide supporting evidence of practical utility when the theoretical assumptions are satisfied.

major comments (2)
  1. [§4 (Convergence Analysis)] §4 (Convergence Analysis), Theorem 1 (or equivalent ISR-agnostic bound): the claim that inserting MI supervision and NE regularization at hidden layers leaves the O(1/sqrt(T)) rate unchanged requires an explicit lemma bounding the variance of the MI gradient estimators independently of depth, representation dimension, and estimator choice (variational or contrastive). Without such a bound inside the descent lemma or the federated drift term, the non-asymptotic constants become architecture- and dataset-specific, undermining the central “ISR-agnostic” assertion.
  2. [§5 (Federated Drift Bound)] §5 (Federated Drift Bound): the drift-reduction claim is load-bearing for the hybrid-mode guarantee, yet the derivation does not appear to quantify how the additional stochastic gradients from multiple MI terms interact with local-update bias; if these terms are treated as fixed rather than jointly optimized, the bound may not remain valid under the same hyper-parameters used in the experiments.
minor comments (2)
  1. [Abstract] The abstract states a 22.60% CL-FL gap reduction and 37/48 wins but does not specify the exact baseline algorithm, metric, or whether error bars and statistical tests accompany the figure; adding these details would improve interpretability.
  2. [§3 (Method)] Notation for the intermediate supervision loss (MI term) and regularization strength should be introduced once with a clear dependence on layer index to avoid ambiguity when the same symbols appear in both CL and FL modes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. The comments focus on strengthening the rigor of our theoretical claims, which we address directly below. We clarify the assumptions underlying our ISR-agnostic bounds and commit to revisions that make the analysis more explicit without altering the core contributions.

read point-by-point responses
  1. Referee: [§4 (Convergence Analysis)] §4 (Convergence Analysis), Theorem 1 (or equivalent ISR-agnostic bound): the claim that inserting MI supervision and NE regularization at hidden layers leaves the O(1/sqrt(T)) rate unchanged requires an explicit lemma bounding the variance of the MI gradient estimators independently of depth, representation dimension, and estimator choice (variational or contrastive). Without such a bound inside the descent lemma or the federated drift term, the non-asymptotic constants become architecture- and dataset-specific, undermining the central “ISR-agnostic” assertion.

    Authors: We thank the referee for this precise observation. Theorem 1 establishes the O(1/sqrt(T)) rate by absorbing the ISR contributions into the standard SGD descent lemma under the assumption that the combined gradient (including MI and NE terms) satisfies bounded variance and smoothness conditions equivalent to those for the base loss. The ISR-agnostic property refers to the fact that the convergence rate itself is unaffected by the presence of ISR, with any additional constants folded into the generic O(·) notation. However, we agree that an explicit variance bound for the MI estimators would make the independence from depth and estimator choice fully transparent. In the revision we will insert a supporting lemma (placed immediately before the main descent argument) that bounds the variance of both variational and contrastive MI estimators in terms of batch size, representation Lipschitz constants, and a uniform bound on the MI estimate itself, showing that depth dependence enters only through these mild constants rather than altering the rate. This lemma will also be referenced in the federated drift term. revision: yes

  2. Referee: [§5 (Federated Drift Bound)] §5 (Federated Drift Bound): the drift-reduction claim is load-bearing for the hybrid-mode guarantee, yet the derivation does not appear to quantify how the additional stochastic gradients from multiple MI terms interact with local-update bias; if these terms are treated as fixed rather than jointly optimized, the bound may not remain valid under the same hyper-parameters used in the experiments.

    Authors: We appreciate the referee drawing attention to this interaction. The current drift bound in §5 models the MI supervision gradients as stochastic corrections that reduce client drift by aligning intermediate representations; the proof treats them as part of the effective local gradient rather than fixed additives. To address the concern about joint optimization and bias under the experimental hyper-parameters, we will revise the derivation to explicitly decompose the local-update bias into the standard federated term plus an additional term arising from the stochastic MI gradients. Under the same weighting coefficients and step sizes used in the experiments, we show that the MI-induced bias remains controlled by the same Lipschitz and bounded-gradient assumptions already stated in the paper, yielding a tightened drift bound that continues to hold for the hybrid CL-FL mode. The revised proof will include this decomposition as a separate proposition. revision: yes

Circularity Check

0 steps flagged

No circularity: ISR-agnostic bound derived from standard SGD analysis

full rationale

The paper derives a unified ISR-agnostic O(1/sqrt(T)) convergence bound, federated drift bound, gradient-alignment guarantee, and escape-time bound directly from standard SGD descent lemmas while showing that added MI supervision and NE regularization terms do not alter the rate. No equations reduce the bound to a fitted ISR strength or to a self-citation chain; the agnosticism claim is supported by absorbing extra stochastic terms under mild bias assumptions without layer-dependent constants being redefined by the same experimental hyperparameters. The derivation is therefore self-contained against external SGD benchmarks and does not rely on renaming or smuggling ansatzes via prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard SGD convergence assumptions plus the modeling choice that intermediate-layer MI and NE signals can be inserted without altering the underlying loss landscape in a way that violates the stated bounds.

axioms (1)
  • domain assumption Standard non-asymptotic SGD convergence assumptions hold when ISR terms are added at hidden layers.
    Invoked to claim the O(1/sqrt(T)) bound remains valid.

pith-pipeline@v0.9.0 · 5872 in / 1336 out tokens · 29250 ms · 2026-05-21T08:37:06.020091+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 7 internal anchors

  1. [1]

    An incen- tive mechanism of incorporating supervision game for federated learning in autonomous driving,

    Y. Fu, C. Li, F. R. Yu, T. H. Luan, and P . Zhao, “An incen- tive mechanism of incorporating supervision game for federated learning in autonomous driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 12, pp. 14 800–14 812, 2023

  2. [2]

    S- nerf++: Autonomous driving simulation via neural reconstruction and generation,

    Y. Chen, J. Zhang, Z. Xie, W. Li, F. Zhang, J. Lu, and L. Zhang, “S- nerf++: Autonomous driving simulation via neural reconstruction and generation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 6, pp. 4358–4376, 2025

  3. [3]

    Edge intelligence empowered vehicle detection and image segmentation for autonomous vehicles,

    C. Chen, C. Wang, B. Liu, C. He, L. Cong, and S. Wan, “Edge intelligence empowered vehicle detection and image segmentation for autonomous vehicles,”IEEE Transactions on Intelligent Trans- portation Systems, vol. 24, no. 11, pp. 13 023–13 034, 2023

  4. [4]

    Communication-efficient learning of deep networks from decentralized data,

    H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” inAISTATS, 2017, pp. 1273–1282

  5. [5]

    Sta- bilizing and accelerating federated learning on heterogeneous data with partial client participation,

    H. Zhang, C. Li, W. Dai, Z. Zheng, J. Zou, and H. Xiong, “Sta- bilizing and accelerating federated learning on heterogeneous data with partial client participation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 1, pp. 67–83, 2025

  6. [6]

    Co- boosting++: Coupled optimization of data and ensemble for one- shot federated learning,

    X. Yang, R. Dai, Y. Zhang, A. Li, T. Liu, and B. Han, “Co- boosting++: Coupled optimization of data and ensemble for one- shot federated learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–14, 2026

  7. [7]

    Sample-level prototypical federated learning,

    C. Meng, J. Yang, H. Niu, G. Habault, R. Legaspi, S. Wada, C. Ono, and Y. Liu, “Sample-level prototypical federated learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 2, pp. 1133–1144, 2026

  8. [8]

    Dfedadmm: Dual constraint controlled model inconsistency for decentralize feder- ated learning,

    Q. Li, L. Shen, G. Li, Q. Yin, and D. Tao, “Dfedadmm: Dual constraint controlled model inconsistency for decentralize feder- ated learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 6, pp. 4803–4815, 2025

  9. [9]

    Fluid: Mitigating stragglers in federated learning using invariant dropout,

    I. Wang, P . Nair, and D. Mahajan, “Fluid: Mitigating stragglers in federated learning using invariant dropout,”Advances in Neural Information Processing Systems, vol. 36, pp. 73 258–73 273, 2023

  10. [10]

    Fedlsc: Improving com- munication efficiency and robustness in federated learning with stragglers and adversaries,

    H.-G. Joo, S. Hong, and D.-J. Shin, “Fedlsc: Improving com- munication efficiency and robustness in federated learning with stragglers and adversaries,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 11, pp. 19 805–19 819, 2025

  11. [11]

    Toward efficient and scalable asynchronous federated learning via stragglers version control,

    C. Chen, Y. Zhao, Z. Zhang, W. Li, and J. Wu, “Toward efficient and scalable asynchronous federated learning via stragglers version control,”IEEE Transactions on Mobile Computing, vol. 25, no. 2, pp. 2627–2643, 2026

  12. [12]

    SCAFFOLD: Stochastic controlled averaging for federated learning,

    S. P . Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” inICML, 2020, pp. 5132–5143

  13. [13]

    Neural networks trained with SGD learn distributions of increasing complexity,

    M. Refinetti, A. Ingrosso, and S. Goldt, “Neural networks trained with SGD learn distributions of increasing complexity,” inICML, 2023, pp. 28 843–28 863

  14. [14]

    Adam: A Method for Stochastic Optimization

    D. P . Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,”arXiv preprint arXiv:1412.6980, 2014

  15. [15]

    Improving neural networks by preventing co-adaptation of feature detectors

    G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co- adaptation of feature detectors,”arXiv preprint arXiv:1207.0580, 2012

  16. [16]

    Regression shrinkage and selection via the lasso,

    R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society Series B, vol. 58, no. 1, pp. 267–288, 1996

  17. [17]

    Sharpness- aware minimization for efficiently improving generalization,

    P . Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness- aware minimization for efficiently improving generalization,” in ICLR, 2021

  18. [18]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift,

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inICML, 2015, pp. 448–456

  19. [19]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”arXiv preprint arXiv:1607.06450, 2016

  20. [20]

    Take a shortcut back: Mitigating the gradient vanishing for training spiking neural networks,

    Y. Guo, Y. Chen, Z. Hao, W. Peng, Z. Jie, Y. Zhang, X. Liu, and Z. Ma, “Take a shortcut back: Mitigating the gradient vanishing for training spiking neural networks,” inNeurIPS, vol. 37, 2024, pp. 24 849–24 867

  21. [21]

    Gra- dient flow in recurrent nets: the difficulty of learning long-term dependencies,

    S. Hochreiter, Y. Bengio, P . Frasconi, J. Schmidhuberet al., “Gra- dient flow in recurrent nets: the difficulty of learning long-term dependencies,” 2001

  22. [22]

    Detection- based intermediate supervision for visual question answering,

    Y. Liu, D. Peng, W. Wei, Y. Fu, W. Xie, and D. Chen, “Detection- based intermediate supervision for visual question answering,” in AAAI, vol. 38, no. 12, 2024, pp. 14 061–14 068

  23. [23]

    Robust asymmetric heterogeneous federated learning with corrupted clients,

    X. Fang, M. Ye, and B. Du, “Robust asymmetric heterogeneous federated learning with corrupted clients,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 4, pp. 2693– 2705, 2025

  24. [24]

    Toward understanding generalization and stability gaps between centralized and decentralized feder- ated learning,

    Y. Sun, L. Shen, and D. Tao, “Toward understanding generalization and stability gaps between centralized and decentralized feder- ated learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 4, pp. 4744–4755, 2026

  25. [25]

    Tighter regret analysis and optimization of online federated learning,

    D. Kwon, J. Park, and S. Hong, “Tighter regret analysis and optimization of online federated learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 772– 15 789, 2023

  26. [26]

    Federated optimization in heterogeneous networks,

    T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” MLSys, 2020

  27. [27]

    Federated learning based on dynamic regulariza- tion,

    D. A. E. Acar, Y. Zhao, R. Matas, M. Mattina, P . Whatmough, and V . Saligrama, “Federated learning based on dynamic regulariza- tion,” inICLR, 2021

  28. [29]

    Federated visual classification with real-world data distri- bution,

    ——, “Federated visual classification with real-world data distri- bution,” inECCV, 2020, pp. 76–92

  29. [30]

    Model-contrastive federated learning,

    Q. Li, B. He, and D. Song, “Model-contrastive federated learning,” inCVPR, 2021, pp. 10 713–10 722

  30. [31]

    Balancefl: Addressing class imbalance in long-tail federated learning,

    X. Shuai, Y. Shen, S. Jiang, Z. Zhao, Z. Yan, and G. Xing, “Balancefl: Addressing class imbalance in long-tail federated learning,” in 2022 21st ACM/IEEE International Conference on Information Process- ing in Sensor Networks (IPSN), 2022, pp. 271–284

  31. [32]

    FedRC: A rapid-converged hierarchical federated learning frame- work in street scene semantic understanding,

    W.-B. Kou, Q. Lin, M. Tang, S. Wang, G. Zhu, and Y.-C. Wu, “FedRC: A rapid-converged hierarchical federated learning frame- work in street scene semantic understanding,” inIROS, 2024, pp. 2578–2585. IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE, 2026 18

  32. [33]

    Fast-convergent and communication-alleviated hetero- geneous hierarchical federated learning in autonomous driving,

    W.-B. Kou, Q. Lin, M. Tang, R. Ye, S. Wang, G. Zhu, and Y.- C. Wu, “Fast-convergent and communication-alleviated hetero- geneous hierarchical federated learning in autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, 2025

  33. [34]

    FedEMA: Federated exponential moving averaging with nega- tive entropy regularizer in autonomous driving,

    W.-B. Kou, G. Zhu, B. Cheng, S. Wang, M. Tang, and Y.-C. Wu, “FedEMA: Federated exponential moving averaging with nega- tive entropy regularizer in autonomous driving,”arXiv preprint arXiv:2505.00318, 2025

  34. [35]

    pFedLVM: A large vision model- driven and latent feature-based personalized federated learning framework in autonomous driving,

    W.-B. Kou, Q. Lin, M. Tang, S. Xu, R. Ye, Y. Leng, S. Wang, G. Li, Z. Chen, G. Zhuet al., “pFedLVM: A large vision model- driven and latent feature-based personalized federated learning framework in autonomous driving,”IEEE Transactions on Intelli- gent Transportation Systems, 2025

  35. [36]

    FedDrive: Generalizing federated learning to semantic segmentation in autonomous driving,

    L. Fantauzzo, E. Fan `ı, D. Caldarola, A. Tavera, F. Cermelli, M. Ci- ccone, and B. Caputo, “FedDrive: Generalizing federated learning to semantic segmentation in autonomous driving,” inIROS, 2022

  36. [37]

    Communication resources constrained hierarchical federated learning for end-to-end autonomous driving,

    W.-B. Kou, S. Wang, G. Zhu, B. Luo, Y. Chen, D. W. K. Ng, and Y.-C. Wu, “Communication resources constrained hierarchical federated learning for end-to-end autonomous driving,” inIROS, 2023, pp. 9383–9390

  37. [38]

    Reducing non-IID effects in federated autonomous driving with contrastive divergence loss,

    T. Do, B. X. Nguyen, Q. D. Tran, H. Nguyen, E. Tjiputra, T.-C. Chiu, and A. Nguyen, “Reducing non-IID effects in federated autonomous driving with contrastive divergence loss,” inICRA, 2024, pp. 2190–2196

  38. [39]

    Deeply- supervised nets,

    C.-Y. Lee, S. Xie, P . Gallagher, Z. Zhang, and Z. Tu, “Deeply- supervised nets,” inArtificial Intelligence and Statistics. PMLR, 2015, pp. 562–570

  39. [40]

    Going deeper with convolutions,

    C. Szegedy, W. Liu, Y. Jia, P . Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inCVPR, 2015, pp. 1–9

  40. [41]

    Training Deeper Convolutional Networks with Deep Supervision

    L. Wang, C.-Y. Lee, Z. Tu, and S. Lazebnik, “Training deeper convolutional networks with deep supervision,”arXiv preprint arXiv:1505.02496, 2015

  41. [42]

    Pyramid scene parsing network,

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” inCVPR, 2017

  42. [43]

    Bisenet: Bilat- eral segmentation network for real-time semantic segmentation,

    C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilat- eral segmentation network for real-time semantic segmentation,” inECCV, 2018, pp. 325–341

  43. [44]

    Gated-SCNN: Gated shape CNNs for semantic segmentation,

    T. Takikawa, D. Acuna, V . Jampani, and S. Fidler, “Gated-SCNN: Gated shape CNNs for semantic segmentation,” inICCV, 2019, pp. 5228–5237

  44. [45]

    ICNet for real-time semantic segmentation on high-resolution images,

    H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for real-time semantic segmentation on high-resolution images,” inECCV, 2018

  45. [46]

    Contrastive deep supervision,

    L. Zhang, X. Chen, J. Zhang, R. Dong, and K. Ma, “Contrastive deep supervision,” inECCV, 2022, pp. 1–19

  46. [47]

    A comprehensive review on deep supervision: Theories and applications,

    R. Li, X. Wang, G. Huang, W. Yang, K. Zhang, X. Gu, S. N. Tran, S. Garg, J. Alty, and Q. Bai, “A comprehensive review on deep supervision: Theories and applications,”arXiv preprint arXiv:2207.02376, 2022

  47. [48]

    Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

    L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,”arXiv preprint arXiv:1802.02611, 2018

  48. [49]

    TopFormer: Token pyramid transformer for mobile semantic segmentation,

    W. Zhang, Z. Huang, G. Luo, T. Chen, X. Wang, W. Liu, G. Yu, and C. Shen, “TopFormer: Token pyramid transformer for mobile semantic segmentation,” inCVPR, 2022, pp. 12 083–12 093

  49. [50]

    SeaFormer: Squeeze-enhanced axial transformer for mobile semantic segmen- tation,

    Q. Wan, Z. Huang, J. Lu, G. Yu, and L. Zhang, “SeaFormer: Squeeze-enhanced axial transformer for mobile semantic segmen- tation,” inICLR, 2023

  50. [51]

    Opening the Black Box of Deep Neural Networks via Information

    R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,”arXiv preprint arXiv:1703.00810, 2017

  51. [52]

    The IM algorithm: A variational ap- proach to information maximization,

    D. Barber and F. Agakov, “The IM algorithm: A variational ap- proach to information maximization,” inNeurIPS, vol. 16, 2003

  52. [53]

    How to escape saddle points efficiently,

    C. Jin, R. Ge, P . Netrapalli, S. M. Kakade, and M. I. Jordan, “How to escape saddle points efficiently,” inICML, 2017, pp. 1724–1732

  53. [54]

    Lower bounds for non-convex stochastic opti- mization,

    Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth, “Lower bounds for non-convex stochastic opti- mization,”Mathematical Programming, vol. 199, no. 1, pp. 165–214, 2023

  54. [55]

    The Cityscapes dataset for semantic urban scene understanding,

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- nenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” inCVPR, 2016

  55. [56]

    Segmenta- tion and recognition using structure from motion point clouds,

    G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmenta- tion and recognition using structure from motion point clouds,” inProc. European Conference on Computer Vision of the (ECCV), 2008

  56. [57]

    The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes,

    G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” inCVPR, 2016, pp. 3234– 3243

  57. [58]

    Federated multi-task learn- ing for competing constraints,

    T. Li, S. Hu, A. Beirami, and V . Smith, “Federated multi-task learn- ing for competing constraints,”arXiv preprint arXiv:2012.04221, 2020

  58. [59]

    Federated learning based on dynamic reg- ularization,

    D. A. E. Acar, Y. Zhao, R. Matas, M. Mattina, P . Whatmough, and V . Saligrama, “Federated learning based on dynamic reg- ularization,” inInternational Conference on Learning Representa- tions, 2021. [Online]. Available: https://openreview.net/forum?id= B7v4QMR6Z9w

  59. [60]

    Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

    T.-M. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,” arXiv preprint arXiv:1909.06335, 2019

  60. [61]

    Federated visual classification with real-world data distribution,

    T. H. Hsu, H. Qi, and M. Brown, “Federated visual classification with real-world data distribution,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part X 16. Springer, 2020, pp. 76–92

  61. [62]

    Model-contrastive federated learning,

    Q. Li, B. He, and D. Song, “Model-contrastive federated learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 713–10 722

  62. [63]

    Communication-efficient learning of deep networks from decentralized data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Ar- cas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial Intelligence and Statistics. PMLR, 2017, pp. 1273–1282