pith. machine review for the scientific record. sign in

arxiv: 2604.06297 · v1 · submitted 2026-04-07 · 💻 cs.CR · cs.LG

Recognition: no theorem link

FedSpy-LLM: Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLMs

Feiyi Wang, Jian Liu, Syed Irfan Ali Meerza

Pith reviewed 2026-05-10 18:59 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords data reconstruction attackgradient leakagefederated learninglarge language modelsPEFTtoken extractionprivacy attacksubspace structure
0
0 comments X

The pith

A gradient decomposition strategy extracts tokens from LLM gradients by exploiting their rank deficiency and subspace structure, enabling data reconstruction attacks at larger scales even with parameter-efficient fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that reconstruction attacks on gradients shared during federated training of large language models can be extended beyond the small-batch, short-sequence limits of earlier work. It targets the practical case where models are fine-tuned with parameter-efficient methods, which create large null spaces that previously made token recovery difficult. The central move is a decomposition that isolates usable signal components while handling the low-rank properties of the gradients, followed by an alignment step to recover token order. If this holds, it shows that federated learning combined with PEFT does not close the gradient leakage channel for realistic training workloads across encoder, decoder, and encoder-decoder architectures.

Core claim

FedSpy-LLM uses a gradient decomposition that exploits rank deficiency and subspace structure to pull out tokens efficiently while keeping key signal components intact. This directly counters the reconstruction problems caused by the large null space that appears when parameter-efficient fine-tuning is applied. An additional iterative alignment of each token's partial-sequence gradient against the full-sequence gradient recovers the correct ordering, supporting larger batch sizes, longer sequences, and generalization across model families.

What carries the argument

gradient decomposition strategy that exploits rank deficiency and subspace structure of gradients to enable token extraction

If this is right

  • Reconstruction remains feasible on PEFT gradients despite their large null space.
  • The attack scales to batch sizes and sequence lengths larger than those handled by prior gradient-leakage methods.
  • Token ordering can be recovered accurately through iterative partial-to-full gradient alignment.
  • The approach works across encoder-based, decoder-based, and encoder-decoder LLM architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the rank-deficiency property persists across additional PEFT variants, then gradient-sharing defenses may need targeted noise calibrated to that structure rather than generic clipping.
  • The same decomposition lens could be tested on gradients arising from other distributed training regimes outside federated learning.
  • A practical next measurement would be whether the recovered sequences retain enough semantic content to enable downstream inference attacks on private user data.

Load-bearing premise

Gradients produced by PEFT-trained LLMs still contain enough rank deficiency and exploitable subspace structure to support accurate token extraction and ordering even at larger batch sizes and sequence lengths.

What would settle it

Applying the decomposition to gradients from a PEFT-tuned LLM on a batch of size 32 or greater yields token sets whose overlap with the true input is no higher than would be obtained by sampling randomly from the vocabulary.

Figures

Figures reproduced from arXiv: 2604.06297 by Feiyi Wang, Jian Liu, Syed Irfan Ali Meerza.

Figure 1
Figure 1. Figure 1: Overview of FEDSPY-LLM. FEDSPY-LLM enables an adversary to reconstruct client training data by initializing a dummy input and iteratively updating it to match the client’s gradient. To reduce the search space, the server projects candidate tokens onto the gradient’s column space and recovers the correct sequence by comparing individual token gradients iteratively. Proof. Provided in Appendix A. It is known… view at source ↗
Figure 2
Figure 2. Figure 2: Impact of sequence length on the reconstruction efficiency. [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
read the original abstract

Given the growing reliance on private data in training Large Language Models (LLMs), Federated Learning (FL) combined with Parameter-Efficient Fine-Tuning (PEFT) has garnered significant attention for enhancing privacy and efficiency. Despite FL's privacy benefits, prior studies have shown that private data can still be extracted from shared gradients. However, these studies, mainly on full-parameter model training, are limited to reconstructing small batches, short input sequences, and specific model architectures, such as encoder-based or decoder-based models. The reconstruction quality becomes even worse when dealing with gradients from PEFT methods. To fully understand the practical attack surface of federated LLMs, this paper proposes FedSpy-LLM, a scalable and generalizable data reconstruction attack designed to reconstruct training data with larger batch sizes and longer sequences while generalizing across diverse model architectures, even when PEFT methods are deployed for training. At the core of FedSpy-LLM is a novel gradient decomposition strategy that exploits the rank deficiency and subspace structure of gradients, enabling efficient token extraction while preserving key signal components at scale. This approach further mitigates the reconstruction challenges introduced by PEFT's substantial null space, ensuring robustness across encoder-based, decoder-based, and encoder-decoder model architectures. Additionally, by iteratively aligning each token's partial-sequence gradient with the full-sequence gradient, FedSpy-LLM ensures accurate token ordering in reconstructed sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes FedSpy-LLM, a data reconstruction attack from shared gradients in federated learning of LLMs trained with PEFT. It introduces a gradient decomposition strategy that exploits rank deficiency and subspace structure to extract tokens efficiently at larger batch sizes and sequence lengths, mitigates PEFT null-space issues, generalizes across encoder/decoder/encoder-decoder architectures, and recovers token ordering via iterative partial-to-full sequence gradient alignment.

Significance. If the decomposition and reconstruction claims hold with the reported scalability, the work would meaningfully extend the attack surface analysis for federated LLM training, highlighting practical privacy risks under PEFT and larger-scale settings that prior gradient-inversion methods could not address. This could inform defense design in FL systems.

major comments (2)
  1. [Gradient decomposition strategy] The central claim rests on the gradient decomposition exploiting rank deficiency and subspace structure to support larger batches and longer sequences (abstract and method description). No analysis, bound, or scaling experiment is provided showing that the effective rank of the (PEFT) gradient matrix remains sufficiently low relative to token count as batch size grows; larger batches add independent directions that can raise numerical rank and shrink the exploitable null space, directly threatening the scalability assertion.
  2. [PEFT handling and experimental validation] The mitigation of PEFT's substantial null space is asserted to preserve key signal components at scale, yet no quantitative measure (e.g., signal-to-noise ratio before/after decomposition, or reconstruction accuracy drop versus full-parameter baselines) is given to substantiate that the decomposition retains sufficient information for accurate token extraction and ordering.
minor comments (1)
  1. [Abstract] The abstract and high-level description omit equations or pseudocode for the decomposition and iterative alignment steps, which would aid clarity even if full details appear later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the empirical support for our claims without misrepresenting the current manuscript.

read point-by-point responses
  1. Referee: The central claim rests on the gradient decomposition exploiting rank deficiency and subspace structure to support larger batches and longer sequences (abstract and method description). No analysis, bound, or scaling experiment is provided showing that the effective rank of the (PEFT) gradient matrix remains sufficiently low relative to token count as batch size grows; larger batches add independent directions that can raise numerical rank and shrink the exploitable null space, directly threatening the scalability assertion.

    Authors: We acknowledge that the manuscript does not include a formal theoretical bound on effective rank or an explicit scaling plot of rank versus batch size. The current work relies on empirical results showing successful reconstruction at larger scales than prior methods. In the revised version, we will add a dedicated scaling experiment in Section 4 that measures and plots the numerical rank of the (PEFT) gradient matrix as batch size and sequence length increase, alongside token recovery rates. We will also expand the method section to provide additional intuition on the subspace structure arising from shared token embeddings and attention patterns in LLMs, which empirically keeps the effective rank low even as batches grow, because new samples frequently align with existing directions rather than introducing fully independent ones. revision: yes

  2. Referee: The mitigation of PEFT's substantial null space is asserted to preserve key signal components at scale, yet no quantitative measure (e.g., signal-to-noise ratio before/after decomposition, or reconstruction accuracy drop versus full-parameter baselines) is given to substantiate that the decomposition retains sufficient information for accurate token extraction and ordering.

    Authors: We agree that quantitative metrics would provide stronger validation of the PEFT null-space mitigation. The manuscript currently demonstrates generalization across architectures including PEFT through end-to-end reconstruction success, but lacks intermediate signal-quality measures. In the revision, we will add experiments reporting signal-to-noise ratios of the key gradient components before and after decomposition, as well as side-by-side tables comparing reconstruction accuracy (exact token match rate, sequence-level BLEU, and ordering fidelity) for PEFT versus full-parameter baselines at multiple batch sizes and sequence lengths. These additions will quantify any performance drop and confirm that sufficient signal is retained for token extraction and ordering. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The abstract and summary present FedSpy-LLM as a novel gradient decomposition strategy exploiting rank deficiency and subspace structure, with no equations, derivations, or self-citations shown. No load-bearing step reduces by construction to fitted inputs, self-definitions, or author-prior ansatzes. The method is described as new without referencing prior fitted parameters or uniqueness theorems from the same authors. This is the common case of a self-contained proposed technique; the central claim does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are specified or needed to evaluate from the given text.

pith-pipeline@v0.9.0 · 5556 in / 1037 out tokens · 29745 ms · 2026-05-10T18:59:05.800698+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    Deep leakage from gradients,

    L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,”Advances in neural information processing systems, vol. 32, 2019

  2. [2]

    TAG: Gradient attack on transformer-based language models,

    J. Deng, Y . Wang, J. Li, C. Wang, C. Shang, H. Liu, S. Rajasekaran, and C. Ding, “TAG: Gradient attack on transformer-based language models,” inFindings of the Association for Computational Linguistics: EMNLP 2021(M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, eds.), (Punta Cana, Dominican Republic), pp. 3600–3610, Association for Computational Ling...

  3. [3]

    Lamp: Extracting text from gradients with language model priors,

    M. Balunovic, D. Dimitrov, N. Jovanovi ´c, and M. Vechev, “Lamp: Extracting text from gradients with language model priors,”Advances in Neural Information Processing Systems, vol. 35, pp. 7641–7654, 2022

  4. [4]

    Beyond gradient and priors in privacy attacks: Leveraging pooler layer inputs of language models in federated learning,

    J. Li, S. Liu, and Q. Lei, “Beyond gradient and priors in privacy attacks: Leveraging pooler layer inputs of language models in federated learning,” inInternational Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023

  5. [5]

    Recovering private text in federated learning of language models,

    S. Gupta, Y . Huang, Z. Zhong, T. Gao, K. Li, and D. Chen, “Recovering private text in federated learning of language models,”Advances in neural information processing systems, vol. 35, pp. 8130–8143, 2022

  6. [6]

    Dager: Exact gradient inversion for large language models,

    I. Petrov, D. I. Dimitrov, M. Baader, M. M ¨uller, and M. Vechev, “Dager: Exact gradient inversion for large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 87801–87830, 2024

  7. [7]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  8. [8]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

  9. [9]

    Security and privacy challenges of large language models: A survey,

    B. C. Das, M. H. Amini, and Y . Wu, “Security and privacy challenges of large language models: A survey,”arXiv preprint arXiv:2402.00888, 2024

  10. [10]

    Communication-efficient learning of deep networks from decentralized data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics, pp. 1273–1282, PMLR, 2017

  11. [11]

    Fedlegal: The first real-world federated learning benchmark for legal nlp,

    Z. Zhang, X. Hu, J. Zhang, Y . Zhang, H. Wang, L. Qu, and Z. Xu, “Fedlegal: The first real-world federated learning benchmark for legal nlp,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3492–3507, 2023

  12. [12]

    Privacy-first health research with federated learning,

    A. Sadilek, L. Liu, D. Nguyen, M. Kamruzzaman, S. Serghiou, B. Rader, A. Ingerman, S. Mellem, P. Kairouz, E. O. Nsoesie,et al., “Privacy-first health research with federated learning,”NPJ digital medicine, vol. 4, no. 1, p. 132, 2021

  13. [13]

    Inverting gradients-how easy is it to break privacy in federated learning?,

    J. Geiping, H. Bauermeister, H. Dr ¨oge, and M. Moeller, “Inverting gradients-how easy is it to break privacy in federated learning?,” Advances in neural information processing systems, vol. 33, pp. 16937– 16947, 2020

  14. [14]

    See through gradients: Image batch recovery via gradinversion,

    H. Yin, A. Mallya, A. Vahdat, J. M. Alvarez, J. Kautz, and P. Molchanov, “See through gradients: Image batch recovery via gradinversion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16337–16346, 2021

  15. [15]

    Auditing privacy defenses in federated learning via generative gradient leakage,

    Z. Li, J. Zhang, L. Liu, and J. Liu, “Auditing privacy defenses in federated learning via generative gradient leakage,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10132–10142, 2022

  16. [16]

    Gradient inversion attacks on acoustic signals: Revealing security risks in audio recognition systems,

    P. R. Ovi and A. Gangopadhyay, “Gradient inversion attacks on acoustic signals: Revealing security risks in audio recognition systems,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839, IEEE, 2024

  17. [17]

    Speech privacy leakage from shared gradi- ents in distributed learning,

    Z. Li, J. Zhang, and J. Liu, “Speech privacy leakage from shared gradi- ents in distributed learning,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, IEEE, 2023

  18. [18]

    Fedpetuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models,

    Z. Zhang, Y . Yang, Y . Dai, Q. Wang, Y . Yu, L. Qu, and Z. Xu, “Fedpetuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models,” inAnnual Meeting of the Association of Computational Linguistics 2023, pp. 9963–9977, Association for Computational Linguistics (ACL), 2023

  19. [19]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021

  20. [20]

    Parameter-efficient transfer learning for nlp,

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” inInternational conference on machine learning, pp. 2790–2799, PMLR, 2019

  21. [21]

    Slora: Federated parameter efficient fine-tuning of language models,

    S. Babakniya, A. R. Elkordy, Y . H. Ezzeldin, Q. Liu, K.-B. Song, M. El-Khamy, and S. Avestimehr, “Slora: Federated parameter efficient fine-tuning of language models,”International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS, 2023

  22. [22]

    arXiv preprint arXiv:2205.10162 , year=

    D. Cai, Y . Wu, S. Wang, F. X. Lin, and M. Xu, “Fedadapter: Efficient federated learning for modern nlp,”arXiv preprint arXiv:2205.10162, 2022

  23. [23]

    Sparse is enough in fine-tuning pre-trained large language models,

    W. Song, Z. Li, L. Zhang, H. Zhao, and B. Du, “Sparse is enough in fine-tuning pre-trained large language models,”arXiv preprint arXiv:2312.11875, 2023

  24. [24]

    Gradient inversion with genera- tive image prior,

    J. Jeon, K. Lee, S. Oh, J. Ok,et al., “Gradient inversion with genera- tive image prior,”Advances in neural information processing systems, vol. 34, pp. 29898–29908, 2021

  25. [25]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  26. [26]

    Flocora: Federated learning compression with low-rank adaptation,

    L. G. Ribeiro, M. Leonardon, G. Muller, V . Fresse, and M. Arzel, “Flocora: Federated learning compression with low-rank adaptation,” arXiv preprint arXiv:2406.14082, 2024

  27. [27]

    arXiv preprint arXiv:2401.06432 , year=

    Y . J. Cho, L. Liu, Z. Xu, A. Fahrezi, and G. Joshi, “Heterogeneous low- rank approximation for federated fine-tuning of on-device foundation models,”arXiv preprint arXiv:2401.06432, 2024

  28. [28]

    Feddat: An approach for foundation model finetuning in multi-modal heterogeneous federated learning,

    H. Chen, Y . Zhang, D. Krompass, J. Gu, and V . Tresp, “Feddat: An approach for foundation model finetuning in multi-modal heterogeneous federated learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 11285–11293, 2024

  29. [29]

    Deep learning with low precision by half-wave gaussian quantization,

    Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with low precision by half-wave gaussian quantization,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 5918– 5926, 2017

  30. [30]

    Speeding up convo- lutional neural networks with low rank expansions,

    M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convo- lutional neural networks with low rank expansions,”arXiv preprint arXiv:1405.3866, 2014

  31. [31]

    Fedqlora: Federated quantization-aware lora for large language models,

    Z. Hu, L. Zhang, S. Dai, S. Gong, and Q. Shi, “Fedqlora: Federated quantization-aware lora for large language models,” 2025

  32. [32]

    Fdlora: Personalized federated learning of large language model via dual lora tuning.arXiv preprint arXiv:2406.07925, 2024

    J. Qi, Z. Luan, S. Huang, C. Fung, H. Yang, and D. Qian, “Fdlora: Personalized federated learning of large language model via dual lora tuning,”arXiv preprint arXiv:2406.07925, 2024

  33. [33]

    Fedp2eft: Federated learning to personalize parameter efficient fine- tuning for multilingual llms,

    R. Lee, M. Kim, F. Rezk, R. Li, S. I. Venieris, and T. Hospedales, “Fedp2eft: Federated learning to personalize parameter efficient fine- tuning for multilingual llms,”arXiv preprint arXiv:2502.04387, 2025

  34. [34]

    idlg: Improved deep leakage from gradients.arXiv preprint arXiv:2001.02610, 2020

    B. Zhao, K. R. Mopuri, and H. Bilen, “idlg: Improved deep leakage from gradients,”arXiv preprint arXiv:2001.02610, 2020

  35. [35]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever,et al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

  36. [36]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, “Bert: Pre-training of deep bidirectional transformers for language understanding,”arXiv preprint arXiv:1810.04805, 2018

  37. [37]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale,et al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  38. [38]

    Neural network accept- ability judgments,

    A. Warstadt, A. Singh, and S. R. Bowman, “Neural network accept- ability judgments,”Transactions of the Association for Computational Linguistics, vol. 7, pp. 625–641, 2019

  39. [39]

    Recursive deep models for semantic compositionality over a sentiment treebank,

    R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y . Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” inProceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013

  40. [40]

    Mimic-iii, a freely accessible critical care database,

    A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark, “Mimic-iii, a freely accessible critical care database,”Scientific data, vol. 3, no. 1, pp. 1–9, 2016

  41. [41]

    Rouge: A package for automatic evaluation of summaries,

    C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, pp. 74–81, 2004

  42. [42]

    Cross-type biomedical named entity recognition with deep multi-task learning,

    X. Wang, Y . Zhang, X. Ren, Y . Zhang, M. Zitnik, J. Shang, C. Langlotz, and J. Han, “Cross-type biomedical named entity recognition with deep multi-task learning,”Bioinformatics, vol. 35, no. 10, pp. 1745–1752, 2019

  43. [43]

    Differentially private federated learning: A client level perspective.arXiv preprint arXiv:1712.07557, 2017

    R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated learning: A client level perspective,”arXiv preprint arXiv:1712.07557, 2017

  44. [44]

    R-gap: Recursive gradient attack on privacy,

    J. Zhu and M. Blaschko, “R-gap: Recursive gradient attack on privacy,” arXiv preprint arXiv:2010.07733, 2020

  45. [45]

    April: Finding the achilles’ heel on privacy for vision transformers,

    J. Lu, X. S. Zhang, T. Zhao, X. He, and J. Cheng, “April: Finding the achilles’ heel on privacy for vision transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10051–10060, 2022

  46. [46]

    On the reciprocal of the general algebraic matrix,

    E. H. Moore, “On the reciprocal of the general algebraic matrix,” Bulletin of the american mathematical society, vol. 26, pp. 294–295, 1920

  47. [47]

    Biobert: a pre-trained biomedical language representation model for biomedical text mining,

    J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,”Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020

  48. [48]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz,et al., “Huggingface’s trans- formers: State-of-the-art natural language processing,”arXiv preprint arXiv:1910.03771, 2019

  49. [49]

    Pytorch: An imperative style, high-performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,et al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, 2019. APPENDIX A. Theorems and Proofs Theorem 1.As the attention layers of the transformer are linear layers...