arxiv: 2605.10748 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Provable Sparse Inversion and Token Relabel Enhanced One-shot Federated Learning with ViTs

Li Shen , Xiaolei Hao , Qinglun Li , Xiaochun Cao , Zhifeng Hao , Xun Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords one-shot federated learningvision transformersmodel inversiontoken relabelingnon-IID datageneralization boundssparse inversiondata-free learning

0 comments

The pith

Sparse foreground inversion and token relabeling deliver tighter generalization bounds for one-shot federated ViT learning under non-IID data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a new framework called FedMITR can solve semantic misalignment in data-free one-shot federated learning by generating synthetic images through selective inversion of only the meaningful foreground regions while stopping inversion of uninformative backgrounds. For Vision Transformer models it further applies a differentiated strategy where high-density patches keep their generated pseudo-labels but low-density patches receive new labels from an ensemble of models. Algorithmic stability analysis shows this pair of changes removes gradient instability caused by background noise and lowers overall gradient variance, which produces a provably tighter bound on how well the learned global model will generalize. A sympathetic reader would care because one-shot federated learning cuts communication to a single round yet usually fails when client data distributions differ sharply; fixing the quality of the synthetic data and the use of every patch could make such efficient distributed training reliable for vision tasks.

Core claim

The FedMITR framework trains the global model by fully exploiting all patches of synthetic images through sparse model inversion that selectively inverts semantic foregrounds and halts background inversion, paired with a token relabel strategy that uses generated pseudo-labels for high-information-density patches and ensemble relabeling for low-density patches, which the stability analysis proves eliminates gradient instability from background noise and reduces gradient variance to guarantee a tighter generalization bound.

What carries the argument

Sparse Model Inversion combined with the differentiated Token Relabel strategy inside the FedMITR framework, which isolates foreground semantics during data synthesis and applies ensemble distillation only to uninformative ViT tokens.

If this is right

All patches of the synthetic images can be used for ViT training instead of being discarded when they carry low information.
Gradient instability arising from background noise is removed by the sparse inversion step.
Gradient variance during distillation is reduced by switching to ensemble relabeling on low-density tokens.
The resulting generalization bound is provably tighter than that of prior data-free one-shot methods.
Empirical accuracy exceeds existing baselines across multiple non-IID partitions and ViT architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same foreground-focused inversion plus density-aware relabeling pattern could be tested on other transformer families beyond ViTs to check whether the stability benefit generalizes.
The stability analysis supplies a concrete recipe for designing regularization terms that penalize background leakage in any data-free federated setting.
Measuring per-patch information density at inference time might yield a lightweight way to decide which tokens need relabeling without retraining ensembles.

Load-bearing premise

Sparse inversion of synthetic images can reliably isolate semantic foregrounds without discarding information needed for accurate ViT predictions, and ensemble relabeling of low-density patches produces labels that remain sufficiently robust under extreme non-IID partitions.

What would settle it

An ablation that disables sparse inversion while keeping the rest of the pipeline fixed and measures whether gradient instability from background noise increases and the empirical generalization gap widens on a standard non-IID benchmark.

Figures

Figures reproduced from arXiv: 2605.10748 by Li Shen, Qinglun Li, Xiaochun Cao, Xiaolei Hao, Xun Yang, Zhifeng Hao.

**Figure 1.** Figure 1: t-SNE visualisation of the features. (a)-(c) represent the feature distribution visualizations for the original training images, the synthetic images using [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: An illustration of server training process of FedMITR, which consists of two stages: (1) In the model inversion stage, we invert well-trained local [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Label Distribution per Client under Dirichlet Partition. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of Accuracy Between the FedAvg Method with 100 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Test accuracy of the server model of FedMITR using different hyper-parameters (a) [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of synthetic data and training data [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

One-Shot Federated Learning, where a central server learns a global model in a single communication round, has emerged as a promising paradigm. However, under extremely non-IID settings, existing data-free methods often generate low-quality data that suffers from severe semantic misalignment with ground-truth labels. To overcome these issues, we propose a novel Federated Model Inversion and Token Relabel (FedMITR) framework, which trains the global model by fully exploiting all patches of synthetic images. Specifically, FedMITR employs sparse model inversion during data generation, selectively inverting semantic foregrounds while halting the inversion of uninformative backgrounds. To address semantically meaningless tokens that hinder ViT predictions, we implement a differentiated strategy: patches with high information density utilize generated pseudo-labels, while patches with low information density are relabeled via ensemble models for robust distillation. Theoretically, our analysis based on algorithmic stability reveals that Sparse Model Inversion eliminates gradient instability arising from background noise, while Token Relabel effectively reduces gradient variance, collectively guaranteeing a tighter generalization bound. Empirically, extensive experimental results demonstrate that FedMITR substantially outperforms existing baselines under various settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedMITR pairs sparse foreground inversion with differentiated token relabeling in one-shot ViT federated learning, but the stability bound likely needs extra work to handle the discontinuous selection step.

read the letter

The paper's core move is FedMITR, a one-shot federated learning setup for Vision Transformers that generates synthetic data via sparse model inversion (stopping on uninformative backgrounds) and then applies a split relabeling rule: high-density patches keep their pseudo-labels while low-density ones get ensemble distillation. This targets the semantic misalignment that shows up in data-free methods under extreme non-IID partitions. The combination is new enough in the one-shot ViT context and directly addresses how ViTs process patches, which is a reasonable practical step forward from prior inversion and distillation tricks. It does a clean job naming the background noise and meaningless token problems and proposing targeted fixes that could matter for communication-limited edge or cross-silo deployments. The empirical section claims clear gains over baselines across settings, which at least suggests the approach is worth testing in similar pipelines. The main soft spot sits in the theory. The abstract says algorithmic stability shows sparse inversion removes gradient instability from background noise and token relabel cuts variance to give a tighter bound. Sparse selection, however, is a hard threshold or argmax on patch features, a non-continuous map that standard uniform stability results for SGD do not cover without an auxiliary sensitivity argument. If that step is missing or only sketched, the claimed guarantee does not follow even if the foreground isolation works in practice. Without the full derivation or ablation tables it is also hard to tell how much the reported wins depend on specific hyperparameter choices or data splits. This is for people already working on one-shot or data-free federated learning who need concrete ViT adaptations. A reader looking for incremental engineering improvements rather than a broad shift would find usable ideas here. Send it to peer review so referees can check the stability proof and the experimental controls.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes the FedMITR framework for one-shot federated learning with Vision Transformers. It introduces Sparse Model Inversion to selectively invert semantic foregrounds of synthetic images while halting background inversion, paired with a Differentiated Token Relabel strategy that applies generated pseudo-labels to high-density patches and ensemble relabeling to low-density patches. The paper claims that an algorithmic stability analysis shows these steps eliminate background-induced gradient instability and reduce variance, yielding a tighter generalization bound, and reports substantial empirical gains over baselines under extreme non-IID partitions.

Significance. If the stability bound is rigorously established, the work would offer a principled mechanism for improving synthetic data quality in data-free one-shot FL, with particular relevance to ViT architectures that are sensitive to token-level noise. The explicit credit for the sparse-inversion-plus-relabel combination as a way to exploit all patches is a constructive contribution, but its impact depends on whether the theoretical tightening is independent of the method's own hyperparameters.

major comments (1)

Abstract (theoretical claim): the assertion that Sparse Model Inversion 'eliminates gradient instability arising from background noise' and thereby guarantees a tighter generalization bound assumes that standard algorithmic stability results extend to the non-smooth sparse selection operator. Sparse inversion performs a hard threshold or argmax on patch information density, which is discontinuous and non-Lipschitz; uniform stability bounds for SGD require Lipschitz continuity of the loss in the data and small continuous perturbations. No auxiliary lemma bounding the sensitivity of the selected foreground subset to inversion-model perturbations is referenced, so the claimed elimination of instability and the resulting bound tightness do not follow from the stated premises.

minor comments (1)

Abstract: the single-paragraph presentation interleaves method, theory, and results so densely that the logical flow from sparse inversion to stability improvement is difficult to parse on first reading.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and technically precise comment on the stability analysis. We agree that the non-smooth nature of the sparse selection step requires an auxiliary result to rigorously extend standard algorithmic stability bounds, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract (theoretical claim): the assertion that Sparse Model Inversion 'eliminates gradient instability arising from background noise' and thereby guarantees a tighter generalization bound assumes that standard algorithmic stability results extend to the non-smooth sparse selection operator. Sparse inversion performs a hard threshold or argmax on patch information density, which is discontinuous and non-Lipschitz; uniform stability bounds for SGD require Lipschitz continuity of the loss in the data and small continuous perturbations. No auxiliary lemma bounding the sensitivity of the selected foreground subset to inversion-model perturbations is referenced, so the claimed elimination of instability and the resulting bound tightness do not follow from the stated premises.

Authors: We thank the referee for highlighting this important technical gap. The sparse inversion indeed applies a hard selection on patch information density, which is discontinuous. In the revised manuscript we will insert an auxiliary lemma (new Lemma 3.2) that bounds the sensitivity of the foreground subset to perturbations in the inversion model. Under the assumption that the information-density scoring function is Lipschitz continuous with respect to model parameters (satisfied by the convolutional or ViT-based inversion networks used), the symmetric difference between selected subsets is controlled by O(perturbation magnitude). This allows the composite loss to remain Lipschitz with a modified constant, so that the standard uniform-stability argument for SGD carries through with an additive term that vanishes as the perturbation goes to zero. Consequently, the background-induced gradient instability is provably suppressed and the generalization bound tightens as claimed. We will also update the abstract and Section 3 to reference the new lemma explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper's central theoretical claim applies standard algorithmic stability analysis to the effects of Sparse Model Inversion (foreground selection) and Token Relabel (ensemble distillation on low-density patches), asserting that these steps eliminate background-induced gradient instability and reduce variance to yield a tighter generalization bound. No equations are presented in which the stability constants are defined in terms of the target bound itself, nor is the bound obtained by fitting parameters to the same data used for the claim. The analysis is not shown to rely on self-citations whose authors overlap and whose uniqueness results are unverified; the derivation therefore remains independent of its own outputs and does not reduce to a tautology or renaming of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The main new elements are the two algorithmic components whose independence from prior literature cannot be verified here.

axioms (1)

domain assumption Algorithmic stability analysis yields a valid generalization bound when applied to the proposed sparse inversion and token relabel training dynamics
The abstract states that the tighter bound follows from stability analysis of these two mechanisms.

invented entities (2)

Sparse Model Inversion no independent evidence
purpose: Selectively inverts semantic foregrounds of synthetic images while halting background inversion to reduce noise
Core component of the FedMITR data-generation step introduced in the abstract.
Differentiated Token Relabel strategy no independent evidence
purpose: Applies generated pseudo-labels to high-information-density patches and ensemble relabeling to low-density patches
Second core component introduced to handle uninformative ViT tokens.

pith-pipeline@v0.9.0 · 5510 in / 1466 out tokens · 43986 ms · 2026-05-12T04:39:33.024723+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

[1]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial Intelligence and statistics. PMLR, 2017, pp. 1273–1282

work page 2017
[2]

Federated Learning: Strategies for Improving Communication Efficiency

J. Konecn `y, H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,”arXiv preprint arXiv:1610.05492, vol. 8, 2016

work page internal anchor Pith review arXiv 2016
[3]

Fedrec++: Lossless federated recommendation with explicit feedback,

F. Liang, W. Pan, and Z. Ming, “Fedrec++: Lossless federated recommendation with explicit feedback,” in Proceedings of the AAAI conference on artificial intel- ligence, vol. 35, no. 5, 2021, pp. 4224–4231

work page 2021
[4]

Fedct: Federated collaborative transfer for recommen- dation,

S. Liu, S. Xu, W. Yu, Z. Fu, Y . Zhang, and A. Marian, “Fedct: Federated collaborative transfer for recommen- dation,” inProceedings of the 44th international ACM SIGIR conference on research and development in infor- mation retrieval, 2021, pp. 716–725

work page 2021
[5]

Feddg: Federated domain generalization on medical image seg- mentation via episodic learning in continuous frequency space,

Q. Liu, C. Chen, J. Qin, Q. Dou, and P.-A. Heng, “Feddg: Federated domain generalization on medical image seg- mentation via episodic learning in continuous frequency space,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1013–1023

work page 2021
[6]

Feddad: Federated domain adaptation for object detection,

P. J. Lu, C.-Y . Jui, and J.-H. Chuang, “Feddad: Federated domain adaptation for object detection,”IEEE Access, 2023. 11

work page 2023
[7]

Visual object de- tection for privacy-preserving federated learning,

J. Zhang, J. Zhou, J. Guo, and X. Sun, “Visual object de- tection for privacy-preserving federated learning,”IEEE Access, vol. 11, pp. 33 324–33 335, 2023

work page 2023
[8]

A secure and efficient federated learning framework for nlp,

J. Deng, C. Wang, X. Meng, Y . Wang, J. Li, S. Lin, S. Han, F. Miao, S. Rajasekaran, and C. Ding, “A secure and efficient federated learning framework for nlp,”arXiv preprint arXiv:2201.11934, 2022

work page arXiv 2022
[9]

Federated learning: Challenges, methods, and future directions,

T. Li, A. K. Sahu, A. Talwalkar, and V . Smith, “Federated learning: Challenges, methods, and future directions,” IEEE signal processing magazine, vol. 37, no. 3, pp. 50– 60, 2020

work page 2020
[10]

Dfedadmm: Dual constraint controlled model inconsistency for de- centralize federated learning,

Q. Li, L. Shen, G. Li, Q. Yin, and D. Tao, “Dfedadmm: Dual constraint controlled model inconsistency for de- centralize federated learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 6, pp. 4803–4815, 2025

work page 2025
[11]

Advances and open problems in federated learning,

P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Ben- nis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummingset al., “Advances and open problems in federated learning,”Foundations and Trends® in Ma- chine Learning, vol. 14, no. 1–2, pp. 1–210, 2021

work page 2021
[12]

Dispfl: Towards communication-efficient personalized federated learning via decentralized sparse training,

R. Dai, L. Shen, F. He, X. Tian, and D. Tao, “Dispfl: Towards communication-efficient personalized federated learning via decentralized sparse training,” inInterna- tional Conference on Machine Learning. PMLR, 2022, pp. 4587–4604

work page 2022
[13]

Man- in-the-middle attacks against machine learning classifiers via malicious generative models,

D. Wang, C. Li, S. Wen, S. Nepal, and Y . Xiang, “Man- in-the-middle attacks against machine learning classifiers via malicious generative models,”IEEE Transactions on Dependable and Secure Computing, vol. 18, no. 5, pp. 2074–2087, 2021

work page 2074
[14]

A survey on se- curity and privacy of federated learning,

V . Mothukuri, R. M. Parizi, S. Pouriyeh, Y . Huang, A. Dehghantanha, and G. Srivastava, “A survey on se- curity and privacy of federated learning,”Future Gener- ation Computer Systems, vol. 115, pp. 619–640, 2021

work page 2021
[15]

See through gradients: Image batch recovery via gradinversion,

H. Yin, A. Mallya, A. Vahdat, J. M. Alvarez, J. Kautz, and P. Molchanov, “See through gradients: Image batch recovery via gradinversion,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2021, pp. 16 337–16 346

work page 2021
[16]

One-shot federated learning,

N. Guha, A. Talwalkar, and V . Smith, “One-shot feder- ated learning,”arXiv preprint arXiv:1902.11175, 2019

work page arXiv 1902
[17]

Modeldb: a system for machine learning model management,

M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia, “Modeldb: a system for machine learning model management,” in Proceedings of the Workshop on Human-In-the-Loop Data Analytics, 2016, pp. 1–3

work page 2016
[18]

Parametric feature transfer: One-shot federated learning with foundation models,

M. Beitollahi, A. Bie, S. Hemati, L. M. Brunswic, X. Li, X. Chen, and G. Zhang, “Parametric feature transfer: One-shot federated learning with foundation models,” arXiv preprint arXiv:2402.01862, 2024

work page arXiv 2024
[19]

Dense: Data-free one-shot fed- erated learning,

J. Zhang, C. Chen, B. Li, L. Lyu, S. Wu, S. Ding, C. Shen, and C. Wu, “Dense: Data-free one-shot fed- erated learning,”Advances in Neural Information Pro- cessing Systems, vol. 35, pp. 21 414–21 428, 2022

work page 2022
[20]

Fine-tuning global model via data-free knowledge dis- tillation for non-iid federated learning,

L. Zhang, L. Shen, L. Ding, D. Tao, and L.-Y . Duan, “Fine-tuning global model via data-free knowledge dis- tillation for non-iid federated learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 174–10 183

work page 2022
[21]

Distilled one-shot federated learning,

Y . Zhou, G. Pu, X. Ma, X. Li, and D. Wu, “Distilled one-shot federated learning,”arXiv preprint arXiv:2009.07999, 2020

work page arXiv 2009
[22]

Practical one-shot federated learning for cross-silo setting,

Q. Li, B. He, and D. Song, “Practical one-shot federated learning for cross-silo setting,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Z.-H. Zhou, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 2021, pp. 1484–1490, main Track. [Online]. Available: https://doi.org/10.24...

work page doi:10.24963/ijcai.2021/205 2021
[23]

Towards addressing label skews in one-shot federated learning,

Y . Diao, Q. Li, and B. He, “Towards addressing label skews in one-shot federated learning,” inThe Eleventh International Conference on Learning Representations,

work page
[24]

Available: https://openreview.net/forum? id=rzrqh85f4Sc

[Online]. Available: https://openreview.net/forum? id=rzrqh85f4Sc

work page
[25]

Data-free one-shot federated learning under very high statistical heterogeneity,

C. E. Heinbaugh, E. Luz-Ricca, and H. Shao, “Data-free one-shot federated learning under very high statistical heterogeneity,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id= hb4vM3jspB

work page 2023
[26]

Exploring one- shot semi-supervised federated learning with pre-trained diffusion models,

M. Yang, S. Su, B. Li, and X. Xue, “Exploring one- shot semi-supervised federated learning with pre-trained diffusion models,” inProceedings of the AAAI Confer- ence on Artificial Intelligence, vol. 38, no. 15, 2024, pp. 16 325–16 333

work page 2024
[27]

Model in- version attacks that exploit confidence information and basic countermeasures,

M. Fredrikson, S. Jha, and T. Ristenpart, “Model in- version attacks that exploit confidence information and basic countermeasures,” inProceedings of the 22nd ACM SIGSAC conference on computer and communications security, 2015, pp. 1322–1333

work page 2015
[28]

Model inversion attacks against collaborative inference,

Z. He, T. Zhang, and R. B. Lee, “Model inversion attacks against collaborative inference,” inProceedings of the 35th Annual Computer Security Applications Conference, 2019, pp. 148–162

work page 2019
[29]

Adversarial neural network inversion via auxiliary knowledge alignment,

Z. Yang, E.-C. Chang, and Z. Liang, “Adversarial neural network inversion via auxiliary knowledge alignment,” arXiv preprint arXiv:1902.08552, 2019

work page arXiv 1902
[30]

Data-free knowl- edge distillation via feature exchange and activation region constraint,

S. Yu, J. Chen, H. Han, and S. Jiang, “Data-free knowl- edge distillation via feature exchange and activation region constraint,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 266–24 275

work page 2023
[31]

Deep classifier mimicry without data access,

S. Braun, M. Mundt, and K. Kersting, “Deep classifier mimicry without data access,” inInternational Confer- ence on Artificial Intelligence and Statistics. PMLR, 2024, pp. 4762–4770

work page 2024
[32]

Learning to retain while acquiring: combating distribution-shift in adver- sarial data-free knowledge distillation,

G. Patel, K. R. Mopuri, and Q. Qiu, “Learning to retain while acquiring: combating distribution-shift in adver- sarial data-free knowledge distillation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7786–7794

work page 2023
[33]

Dreaming to distill: Data-free knowledge transfer via deepinversion,

H. Yin, P. Molchanov, J. M. Alvarez, Z. Li, A. Mallya, D. Hoiem, N. K. Jha, and J. Kautz, “Dreaming to distill: Data-free knowledge transfer via deepinversion,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2020, pp. 8715–8724. 12

work page 2020
[34]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[35]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural informa- tion processing systems, vol. 30, 2017

work page 2017
[36]

Training data-efficient image trans- formers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablay- rolles, and H. J´egou, “Training data-efficient image trans- formers & distillation through attention,” inInternational conference on machine learning. PMLR, 2021, pp. 10 347–10 357

work page 2021
[37]

Enhancing one-shot federated learning through data and ensemble co-boosting,

R. Dai, Y . Zhang, A. Li, T. Liu, X. Yang, and B. Han, “Enhancing one-shot federated learning through data and ensemble co-boosting,”arXiv preprint arXiv:2402.15070, 2024

work page arXiv 2024
[38]

Inverting visual represen- tations with convolutional networks,

A. Dosovitskiy and T. Brox, “Inverting visual represen- tations with convolutional networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4829–4837

work page 2016
[39]

Ensemble distillation for robust model fusion in federated learning,

T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion in federated learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 2351–2363, 2020

work page 2020
[40]

All tokens matter: Token labeling for training better vision transformers,

Z.-H. Jiang, Q. Hou, L. Yuan, D. Zhou, Y . Shi, X. Jin, A. Wang, and J. Feng, “All tokens matter: Token labeling for training better vision transformers,”Advances in neu- ral information processing systems, vol. 34, pp. 18 590– 18 602, 2021

work page 2021
[41]

Not all patches are what you need: Expediting vision transformers via token reorganizations

Y . Liang, C. Ge, Z. Tong, Y . Song, J. Wang, and P. Xie, “Not all patches are what you need: Expediting vision transformers via token reorganizations,”arXiv preprint arXiv:2202.07800, 2022

work page arXiv 2022
[42]

Stability and generaliza- tion,

O. Bousquet and A. Elisseeff, “Stability and generaliza- tion,”Journal of machine learning research, vol. 2, no. Mar, pp. 499–526, 2002

work page 2002
[43]

Train faster, gener- alize better: Stability of stochastic gradient descent,

M. Hardt, B. Recht, and Y . Singer, “Train faster, gener- alize better: Stability of stochastic gradient descent,” in International conference on machine learning. PMLR, 2016, pp. 1225–1234

work page 2016
[44]

A. Blum, J. Hopcroft, and R. Kannan,Foundations of data science. Cambridge University Press, 2020

work page 2020
[45]

A statistical perspective on distillation,

A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar, “A statistical perspective on distillation,” in International Conference on Machine Learning. PMLR, 2021, pp. 7632–7642

work page 2021
[46]

Dreaming to dis- till: Data-free knowledge transfer via deepinversion,

H. Yin, P. Molchanov, J. M. Alvarez, Z. Li, A. Mallya, D. Hoiem, N. K. Jha, and J. Kautz, “Dreaming to dis- till: Data-free knowledge transfer via deepinversion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[47]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,”Master thesis, 2009

work page 2009
[48]

Deep hashing network for unsupervised do- main adaptation,

H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Pan- chanathan, “Deep hashing network for unsupervised do- main adaptation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5018–5027

work page 2017
[49]

Matching networks for one shot learning,

O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstraet al., “Matching networks for one shot learning,”Advances in neural information processing systems, vol. 29, 2016

work page 2016
[50]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern- steinet al., “Imagenet large scale visual recognition challenge,”International journal of computer vision, vol. 115, pp. 211–252, 2015

work page 2015
[51]

Mohri, A

M. Mohri, A. Rostamizadeh, and A. Talwalkar,Founda- tions of machine learning. MIT press, 2018. 13 APPENDIX A. More details about the Experiment Datasets and partitions.Our experiments are conducted on the following four popular real-world datasets: CIFAR10 [46], CIFAR100 [46], OfficeHome [47] and Mini-ImageNet [48]. CIFAR10 [46] consists of 60,000 color ...

work page 2018
[52]

Co-Boosting [36] uses the current Ensemble to synthesize higher-quality samples in an adversarial manner

train a generator that considers similarity, stability, and transferability and performe federated distillation on the server side. Co-Boosting [36] uses the current Ensemble to synthesize higher-quality samples in an adversarial manner. These hard samples are then employed to promote the quality of the Ensemble by adjusting the ensembling weights for eac...

work page arXiv 2017
[53]

Gradient Formulation in Vision Transformers:Let the input sequenceX∈R N×D be partitioned into a set of semantic foreground indicesI h and background noise indicesI l. The Multi-Head Self-Attention (MSA) layer computes the outputZ as: Z=MSA(X) =AXW V ,whereA=Softmax XW QW T KX T √ d (25) Here,A∈R N×N is the attention matrix andW V is the value projection w...

work page
[54]

The error signal isδ DI =p(X)−y, wherep(X)is the softmax probability

Instability of Dense Inversion (DI):In Dense Inversion, the objective is the cross-entropy loss against a hard labely: LDI =ℓ CE(f(X), y). The error signal isδ DI =p(X)−y, wherep(X)is the softmax probability. We analyze the gradient contribution from background noise tokens (j∈ I l). Substituting into Eq. (26): ∇(noise) WV LDI = X j∈Il (Xj)T (AT ):,jδDI (...

work page
[55]

Mechanism 1: Sparsity as Gradient Elimination.For Term 1, we apply a binary maskMsuch that the effective input ˜X=M⊙X, where ˜Xj = 0for noise indicesj∈ I l

Stability of FedMITR:FedMITR decouples the loss function: LF ed =ℓ CE(f(X h), y)| {z } Term 1: Sparsity +λ ℓKL(f(X l), E(Xl))| {z } Term 2: Relabeling (28) We prove that FedMITR strictly reduces the gradient norm via two mechanisms: gradient elimination (Sparsity) and gradient variance reduction (Relabeling). Mechanism 1: Sparsity as Gradient Elimination....

work page
[56]

Main Theorem and Conclusion:We formally establish the inequality of Lipschitz constants. Theorem 1(Lipschitz Constant Reduction).LetL DI andL F ed be the Lipschitz constants of the loss gradients with respect to model parameters for Dense Inversion and FedMITR, respectively. Under Assumption 1, Assumption 2, and Lemma 3, we have: LF ed < L DI (33) Proof.T...

work page
[57]

L(noise) DI ∝sup (∥X noise∥ · ∥δhard∥) Under Assumption 2,∥δ hard∥is saturated (large) due to orthogonality

Analysis of Dense Inversion (L DI ):The noise component is driven by the interaction between the high-energy noise inputX noise and the hard-label error signalδ hard. L(noise) DI ∝sup (∥X noise∥ · ∥δhard∥) Under Assumption 2,∥δ hard∥is saturated (large) due to orthogonality. Thus, the product∥X noise∥ · ∥δhard∥constitutes a large gradient upper bound

work page
[58]

Consequently, the backpropagated gradient norm is: ∥ ˜X T AT δhard∥=∥0·A T δhard∥= 0 •Relabeling Term:The input isX noise, but the error signal isδ sof t

Analysis of FedMITR (L F ed):The noise component is decoupled into two terms: •Sparsity Term:The effective input is masked to zero ( ˜X= 0). Consequently, the backpropagated gradient norm is: ∥ ˜X T AT δhard∥=∥0·A T δhard∥= 0 •Relabeling Term:The input isX noise, but the error signal isδ sof t. The gradient norm is: L(relabel) F ed ∝λsup (∥X noise∥ · ∥δsof t∥)

work page
[59]

According to Lemma 3, the expected norm of the error signal from soft labels is strictly smaller than that from hard labels due to variance reduction:E[∥δ sof t∥]≪E[∥δ hard∥]

Comparison:Combining the components, we compare the upper bounds: LDI ≈L f g +C·E[∥δ hard∥] LF ed ≈L f g + 0 +λC·E[∥δ sof t∥] whereCrepresents the common factor involving input and attention norms. According to Lemma 3, the expected norm of the error signal from soft labels is strictly smaller than that from hard labels due to variance reduction:E[∥δ sof ...

work page