pith. sign in

arxiv: 2404.14204 · v5 · pith:OHB7YQ6Cnew · submitted 2024-04-22 · 💻 cs.NI

TrimCaching: Parameter-sharing Edge Caching for AI Model Downloading

Pith reviewed 2026-05-24 02:31 UTC · model grok-4.3

classification 💻 cs.NI
keywords edge cachingAI model downloadingparameter sharingcache hit ratiowireless networksmodel placementsubmodular optimization
0
0 comments X

The pith

Sharing parameter blocks across AI models lets edge networks cache more models and raise hit ratios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops TrimCaching to place AI models on edge servers by treating shared parameter blocks as reusable storage units rather than caching each model whole. It formulates the placement task as maximizing cache hit ratio while trading off storage use against download latency in multi-edge wireless settings. The authors prove the general problem is submodular maximization under submodular constraints and therefore lacks a constant-factor polynomial algorithm, then supply a (1-ε)/2-approximation for the practical case of a small fixed number of shared blocks and a greedy heuristic for the general case. Simulations show the resulting placements achieve higher hit ratios than conventional content caching that ignores parameter overlap.

Core claim

TrimCaching exploits the fact that many AI models share parameter blocks containing reusable knowledge; modeling this overlap turns model placement into a submodular maximization problem whose solution, via a special-case polynomial algorithm or a greedy method, improves cache hit ratio over non-sharing baselines.

What carries the argument

The parameter-sharing model placement formulation that treats shared blocks as common storage items to maximize hit ratio under latency and storage constraints.

If this is right

  • Edge servers can store a larger effective catalog of models within the same memory budget.
  • Download latency for users requesting models with shared blocks drops because fewer unique blocks need transmission.
  • The approximation algorithms give network operators a concrete way to compute placements without solving the NP-hard general problem.
  • The framework extends to any set of models whose parameter overlap can be quantified in advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If parameter sharing turns out to be dynamic rather than fixed, the placement decisions would need periodic recomputation as new models arrive.
  • The same sharing idea could apply to caching of other composite objects such as container images or dataset shards that contain overlapping files.
  • Operators might combine TrimCaching with popularity prediction to decide which shared blocks to pre-position on which edges.

Load-bearing premise

A wide range of AI models share a significant proportion of parameter blocks that can be treated as reusable across models.

What would settle it

Running the placement algorithm on a trace of real AI models where the measured shared-block overlap is below the level assumed in the special-case analysis and observing no hit-ratio gain over independent-model caching.

Figures

Figures reproduced from arXiv: 2404.14204 by Fangming Liu, Guanqiao Qu, Jian Li, Kaibin Huang, Qian Chen, Xianhao Chen, Zheng Lin.

Figure 1
Figure 1. Figure 1: Inference accuracy vs. the number of frozen bottom layers in fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: The TrimCaching framework in a multi-edge scenario. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of the special case with a small fixed number of shared [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The process of updating T (eN , w˙ N ) for edge server m, where the first row is the index of the number of cache hits and the first column is the model index in IN . by T (eN , w˙ N ) =    min    T (eN − 1, w˙ N ), T [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The process of determining Xˆ m,N , where the first row and the first column are the same as those in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cache hit ratio in the special case, where a small fixed number of shared parameter blocks is considered, using the ResNet-based model library. The [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cache hit ratio in the special case using the GPT-2-based model library. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cache hit ratio in the general case using the ResNet-based model library. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cache hit ratio and average running time of different algorithms. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of user mobility on the cache hit ratio over time. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cache hit ratio comparisons with online placement strategies using [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
read the original abstract

Next-generation mobile networks are expected to facilitate fast AI model downloading to end users. By caching models on edge servers, mobile networks can deliver models to end users with low latency, resulting in a paradigm of edge model caching. In this paper, we develop a novel model placement framework, called parameter-sharing model caching (TrimCaching). TrimCaching exploits the key observation that a wide range of AI models, such as convolutional neural networks or large language models, can share a significant proportion of parameter blocks containing reusable knowledge, thereby improving storage efficiency. To this end, we formulate a parameter-sharing model placement problem to maximize the cache hit ratio in multi-edge wireless networks by balancing the fundamental tradeoff between storage efficiency and service latency. We show that the formulated problem is a submodular maximization problem with submodular constraints, for which no polynomial-time approximation algorithm exists. To tackle this challenge, we study an important special case, where a small fixed number of parameter blocks are shared across models, which often holds in practice. In such a case, a polynomial-time algorithm with a $\left(1-\epsilon\right)/2$-approximation guarantee is developed. Subsequently, we address the original problem for the general case by developing a greedy algorithm. Simulation results demonstrate that the proposed TrimCaching framework significantly improves the cache hit ratio compared with state-of-the-art content caching without exploiting shared parameters in AI models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes TrimCaching, a parameter-sharing edge caching framework for AI model downloading in multi-edge wireless networks. It formulates the model placement problem as maximizing cache hit ratio by exploiting shared parameter blocks across models (e.g., CNNs or LLMs), shows the problem is submodular maximization under submodular constraints (no poly-time approximation exists in general), develops a polynomial-time (1-ε)/2-approximation algorithm for the special case of a small fixed number of shared blocks, proposes a greedy algorithm for the general case, and reports via simulations that TrimCaching significantly improves cache hit ratio over state-of-the-art content caching methods that ignore parameter sharing.

Significance. If the algorithmic guarantees and simulation improvements hold, the work addresses a timely problem in edge computing for large AI models by trading off storage efficiency against latency through parameter reuse. The submodular formulation, the special-case approximation guarantee, and the practical observation about fixed shared blocks are strengths that could influence caching designs in 5G/6G networks.

major comments (3)
  1. [Abstract] Abstract: The central simulation claim (significant cache hit ratio improvement) is load-bearing for the paper's contribution, yet the abstract provides no details on simulation setup, number of trials, error bars, specific network parameters, or how submodularity was verified; this prevents assessment of whether the reported gains are robust or reproducible.
  2. [Abstract] Abstract: The key modeling assumption that 'a wide range of AI models... share a significant proportion of parameter blocks... often holds with a small fixed number of shared blocks in practice' is stated without quantification or reference to concrete model families (e.g., specific CNN or LLM parameter overlap statistics); this assumption directly enables the special-case algorithm and must be supported for the (1-ε)/2 guarantee to be practically relevant.
  3. [Abstract] Abstract: The claim that the formulated problem is 'a submodular maximization problem with submodular constraints, for which no polynomial-time approximation algorithm exists' is presented without a proof sketch or reference to the specific submodular constraint functions; verification of both submodularity and the inapproximability result is required to justify moving to the special-case and greedy algorithms.
minor comments (1)
  1. [Abstract] Abstract: The approximation ratio is written as $(1-ε)/2$; clarify whether ε is a user-specified parameter or derived from the number of shared blocks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the abstract to better support the claims while respecting length constraints.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central simulation claim (significant cache hit ratio improvement) is load-bearing for the paper's contribution, yet the abstract provides no details on simulation setup, number of trials, error bars, specific network parameters, or how submodularity was verified; this prevents assessment of whether the reported gains are robust or reproducible.

    Authors: We agree that the abstract would benefit from additional details on the simulation results. In the revision, we will incorporate key elements such as the number of Monte Carlo trials, mention of error bars, and primary network parameters (e.g., number of edge servers and model sizes). Submodularity is established theoretically (see Section III); we will add a reference to this section. The complete simulation methodology remains in Section V. revision: yes

  2. Referee: [Abstract] Abstract: The key modeling assumption that 'a wide range of AI models... share a significant proportion of parameter blocks... often holds with a small fixed number of shared blocks in practice' is stated without quantification or reference to concrete model families (e.g., specific CNN or LLM parameter overlap statistics); this assumption directly enables the special-case algorithm and must be supported for the (1-ε)/2 guarantee to be practically relevant.

    Authors: We acknowledge that the assumption requires stronger support. We will revise the abstract (or move supporting text to the introduction) to include quantitative examples drawn from the literature on CNNs (e.g., ResNet variants) and LLMs (e.g., BERT/GPT fine-tuning), citing typical shared-block overlap statistics of 30-60%. This will directly justify the practical relevance of the special-case algorithm. revision: yes

  3. Referee: [Abstract] Abstract: The claim that the formulated problem is 'a submodular maximization problem with submodular constraints, for which no polynomial-time approximation algorithm exists' is presented without a proof sketch or reference to the specific submodular constraint functions; verification of both submodularity and the inapproximability result is required to justify moving to the special-case and greedy algorithms.

    Authors: The abstract summarizes results whose full proofs appear in Section III, where we prove submodularity of both the objective (cache-hit ratio) and the per-edge storage constraints, and invoke the known inapproximability of submodular maximization under submodular knapsack constraints. We will revise the abstract to add an explicit reference to Section III and briefly identify the constraint functions. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper formulates a parameter-sharing model placement problem as submodular maximization with submodular constraints, develops a polynomial-time approximation for the special case of fixed shared blocks and a greedy algorithm for the general case, then validates via simulation. These steps rest on standard submodular optimization properties and the independent observation about shared AI model parameters; no equation or result reduces by construction to a fitted input, self-citation, or renamed known pattern. Simulations provide external empirical evidence rather than a forced prediction. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that AI models share reusable parameter blocks in practice, presented as an observation without independent evidence or proof in the abstract.

axioms (1)
  • domain assumption A wide range of AI models share a significant proportion of parameter blocks containing reusable knowledge, often with a small fixed number shared across models.
    Stated as the key observation enabling the framework in the abstract.

pith-pipeline@v0.9.0 · 5793 in / 1238 out tokens · 26086 ms · 2026-05-24T02:31:43.870656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages

  1. [1]

    TrimCaching: Parameter- sharing AI model caching in wireless edge networks,

    G. Qu, Z. Lin, F. Liu, X. Chen, and K. Huang, “TrimCaching: Parameter- sharing AI model caching in wireless edge networks,” in Proc. IEEE Int. Conf. Distrib. Comput. Syst. (ICDCS) , Jersey City, NJ, USA, Jul. 2024, pp. 36–46

  2. [2]

    D²-JSCC: Digital deep joint source-channel coding for semantic communications,

    J. Huang, K. Yuan, C. Huang, and K. Huang, “D²-JSCC: Digital deep joint source-channel coding for semantic communications,” IEEE J. Sel. Areas Commun., vol. 43, no. 4, pp. 1246–1261, Apr. 2025

  3. [3]

    Semantic sleuth: Identifying ponzi contracts via large language models,

    C. Wu, J. Chen, Z. Wang, R. Liang, and R. Du, “Semantic sleuth: Identifying ponzi contracts via large language models,” in Proc. 39th IEEE/ACM Int. Conf. Autom. Softw. Eng. , ser. ASE ’24, Oct. 2024, p. 582–593

  4. [4]

    Toward full-scene domain generalization in multi-agent collaborative bird’s eye view segmentation for connected and autonomous driving,

    S. Hu, Z. Fang, Y . Deng, X. Chen, Y . Fang, and S. Kwong, “Toward full-scene domain generalization in multi-agent collaborative bird’s eye view segmentation for connected and autonomous driving,” IEEE Trans. Intell. Transp. Syst. , vol. 26, no. 2, pp. 1783–1796, Feb. 2025

  5. [5]

    Fedhome: Cloud-edge based personalized federated learning for in-home health monitoring,

    Q. Wu, X. Chen, Z. Zhou, and J. Zhang, “Fedhome: Cloud-edge based personalized federated learning for in-home health monitoring,” IEEE Trans. Mobile Comput. , vol. 21, no. 8, pp. 2818–2832, Aug. 2022

  6. [6]

    Federated learning for smart healthcare: A survey,

    D. C. Nguyen, Q.-V . Pham, P. N. Pathirana, M. Ding, A. Seneviratne, Z. Lin, O. Dobre, and W.-J. Hwang, “Federated learning for smart healthcare: A survey,” ACM Comput. Surv. , vol. 55, no. 3, pp. 1–37, Feb. 2022

  7. [7]

    Privacy risks in reinforcement learning for household robots,

    M. Li, W. Ding, and D. Zhao, “Privacy risks in reinforcement learning for household robots,” in 2024 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2024, pp. 5148–5154

  8. [8]

    FedMeld: A model-dispersal feder- ated learning framework for space-ground integrated networks,

    Q. Chen, X. Chen, and K. Huang, “FedMeld: A model-dispersal feder- ated learning framework for space-ground integrated networks,” arXiv preprint arXiv:2412.17231, 2024

  9. [9]

    EchoHand: High accuracy and presentation attack resistant hand authentication on commodity mobile devices,

    C. Wu, J. Chen, K. He, Z. Zhao, R. Du, and C. Zhang, “EchoHand: High accuracy and presentation attack resistant hand authentication on commodity mobile devices,” in Proc. 2022 ACM SIGSAC Conf. Comput. Commun. Secur., ser. CCS ’22, Nov. 2022, p. 2931–2945

  10. [10]

    2021, version 18.2.0

    3GPP, “3rd generation partnership project; Technical specification group services and system aspects; Study on traffic characteristics and perfor- mance requirements for AI/ML model transfer in 5GS; (Release 18),” 3rd Generation Partnership Project (3GPP), Technical Specification (TS) 22.874, Dec. 2021, version 18.2.0

  11. [11]

    Notable site recognition using deep learning on mobile and crowd-sourced imagery,

    J. Tan, A. Noulas, D. S ´aez, and R. Schifanella, “Notable site recognition using deep learning on mobile and crowd-sourced imagery,” in Proc. 2020 21st IEEE Int. Conf. Mobile Data Manage. (MDM) , Versailles, France, Aug. 2020, pp. 137–147

  12. [12]

    Sense4FL: Vehicular crowdsensing enhanced federated learning for autonomous driving,

    Y . Ma, S. Hu, Z. Fang, Y . Ji, Y . Deng, and Y . Fang, “Sense4FL: Vehicular crowdsensing enhanced federated learning for autonomous driving,” arXiv preprint arXiv:2503.17697 , 2025

  13. [13]

    Space-ground fluid AI for 6G edge intelligence,

    Q. Chen, Z. Wang, X. Chen, J. Wen, D. Zhou, S. Ji, M. Sheng, and K. Huang, “Space-ground fluid AI for 6G edge intelligence,” arXiv preprint arXiv:2411.15845, 2024

  14. [14]

    A joint learning and communications framework for federated learning over wireless networks,

    M. Chen, Z. Yang, W. Saad, C. Yin, H. V . Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun. , vol. 20, no. 1, pp. 269–283, Jan. 2021

  15. [15]

    Byzantine-robust federated learning via cosine similarity aggregation,

    T. Zhu, Z. Guo, C. Yao, J. Tan, S. Dou, W. Wang, and Z. Han, “Byzantine-robust federated learning via cosine similarity aggregation,” Comput. Netw., vol. 254, p. 110730, Dec. 2024

  16. [16]

    Ultra- low-latency edge inference for distributed sensing,

    Z. Wang, A. E. Kalør, Y . Zhou, P. Popovski, and K. Huang, “Ultra- low-latency edge inference for distributed sensing,” arXiv preprint arXiv:2407.13360, 2024

  17. [17]

    Priori- tized information bottleneck theoretic framework with distributed online learning for edge video analytics,

    Z. Fang, S. Hu, J. Wang, Y . Deng, X. Chen, and Y . Fang, “Priori- tized information bottleneck theoretic framework with distributed online learning for edge video analytics,” IEEE/ACM Trans. Netw., pp. 1–17, early access 2025

  18. [18]

    Gemel: Model merging for memory-efficient, real-time video analytics at the edge,

    A. Padmanabhan, N. Agarwal, A. Iyer, G. Ananthanarayanan, Y . Shu, N. Karianakis, G. H. Xu, and R. Netravali, “Gemel: Model merging for memory-efficient, real-time video analytics at the edge,” in Proc. USENIX Symp. Netw. Syst. Des. Implement. (NSDI) , Boston, MA, USA, Apr. 2023, pp. 973–994

  19. [19]

    Pushing large language models to the 6G edge: Vision, challenges, and opportunities,

    Z. Lin, G. Qu, Q. Chen, X. Chen, Z. Chen, and K. Huang, “Pushing large language models to the 6G edge: Vision, challenges, and opportunities,” arXiv preprint arXiv:2309.16739 , 2023

  20. [20]

    Transfer learning & fine-tuning,

    TenserFlow, “Transfer learning & fine-tuning,” 2023. [Online]. Available: https://www.tensorflow.org/guide/keras/transfer learning# introduction

  21. [21]

    A comprehensive survey on transfer learning,

    F. Zhuang, Z. Qi, K. Duan, D. Xi, Y . Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proc. IEEE, vol. 109, no. 1, pp. 43–76, Jan. 2020

  22. [22]

    Spottune: Transfer learning through adaptive fine-tuning,

    Y . Guo, H. Shi, A. Kumar, K. Grauman, T. Rosing, and R. Feris, “Spottune: Transfer learning through adaptive fine-tuning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) , Long Beach, CA, USA, Jun. 2019

  23. [23]

    PACP: Priority-aware collaborative perception for connected and autonomous vehicles,

    Z. Fang, S. Hu, H. An, Y . Zhang, J. Wang, H. Cao, X. Chen, and Y . Fang, “PACP: Priority-aware collaborative perception for connected and autonomous vehicles,” IEEE Trans. Mobile Comput. , pp. 15 003– 15 018, Dec. 2024

  24. [24]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in Proc. Int. Conf. Learn. Represent. (ICLR) , Apr. 2022

  25. [25]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV , USA, Jun. 2016, pp. 770–778

  26. [26]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, G. Hinton et al. , “Learning multiple layers of features from tiny images,” Apr. 2009

  27. [27]

    The E2E dataset: New challenges for end-to-end generation,

    J. Novikova, O. Du ˇsek, and V . Rieser, “The E2E dataset: New challenges for end-to-end generation,” in Proc. Annu. SIGdial Meeting Disc. Dialogue, Saarbr ¨ucken, Germany, Aug. 2017, pp. 201–206

  28. [28]

    Cre- ating training corpora for NLG micro-planners,

    C. Gardent, A. Shimorina, S. Narayan, and L. Perez-Beltrachini, “Cre- ating training corpora for NLG micro-planners,” in Proc. Annu. Meet. Assoc. Comput. Linguist. (ACL), Vancouver, Canada, Jul. 2017, pp. 179– 188

  29. [29]

    Dart: Open-domain structured data record to text generation,

    L. Nan, D. Radev, R. Zhang, A. Rau, A. Sivaprasad, C. Hsieh, X. Tang, A. Vyas, N. Verma, P. Krishna, Y . Liu, N. Irwanto, J. Pan, F. Rahman, A. Zaidi, M. Mutuma, Y . Tarabar, A. Gupta, T. Yu, Y . C. Tan, X. V . Lin, C. Xiong, R. Socher, and N. F. Rajani, “Dart: Open-domain structured data record to text generation,” arXiv preprint arXiv:2007.02871, 2021

  30. [30]

    Femtocaching: Wireless content delivery through distributed caching helpers,

    K. Shanmugam, N. Golrezaei, A. G. Dimakis, A. F. Molisch, and G. Caire, “Femtocaching: Wireless content delivery through distributed caching helpers,” IEEE Trans. Inf. Theory , vol. 59, no. 12, pp. 8402– 8413, Dec. 2013

  31. [31]

    On the complexity of optimal content placement in hierarchical caching networks,

    K. Poularakis and L. Tassiulas, “On the complexity of optimal content placement in hierarchical caching networks,” IEEE Trans. Commun. , vol. 64, no. 5, pp. 2092–2103, Mar. 2016

  32. [32]

    On the complexity of optimal re- quest routing and content caching in heterogeneous cache networks,

    M. Dehghan, B. Jiang, A. Seetharam, T. He, T. Salonidis, J. Kurose, D. Towsley, and R. Sitaraman, “On the complexity of optimal re- quest routing and content caching in heterogeneous cache networks,” IEEE/ACM Trans. Netw., vol. 25, no. 3, pp. 1635–1648, Jun. 2017. 15

  33. [33]

    Edge-caching wireless networks: Performance analysis and optimization,

    T. X. Vu, S. Chatzinotas, and B. Ottersten, “Edge-caching wireless networks: Performance analysis and optimization,” IEEE Trans. Wireless Commun., vol. 17, no. 4, pp. 2827–2839, Apr. 2018

  34. [34]

    Caching at the wireless edge: Design aspects, challenges, and future directions,

    D. Liu, B. Chen, C. Yang, and A. F. Molisch, “Caching at the wireless edge: Design aspects, challenges, and future directions,” IEEE Commun. Mag., vol. 54, no. 9, pp. 22–28, Sep. 2016

  35. [35]

    On energy-efficient edge caching in heterogeneous networks,

    F. Gabry, V . Bioglio, and I. Land, “On energy-efficient edge caching in heterogeneous networks,” IEEE J. Sel. Areas Commun. , vol. 34, no. 12, pp. 3288–3298, Dec. 2016

  36. [36]

    Delay-minimized edge caching in heterogeneous vehicular networks: A matching-based approach,

    H. Wu, J. Chen, W. Xu, N. Cheng, W. Shi, L. Wang, and X. Shen, “Delay-minimized edge caching in heterogeneous vehicular networks: A matching-based approach,” IEEE Trans. Wireless Commun. , vol. 19, no. 10, pp. 6409–6424, Oct. 2020

  37. [37]

    Latency minimization for content delivery networks with wireless edge caching,

    T. X. Vu, L. Lei, S. Vuppala, A. Kalantari, S. Chatzinotas, and B. Otter- sten, “Latency minimization for content delivery networks with wireless edge caching,” in Proc. IEEE Int. Conf.Commun. (ICC) , Kansas City, MO, USA, May 2018, pp. 1–6

  38. [38]

    PartialLoading: User scheduling and bandwidth allocation for parameter-sharing edge inference,

    G. Qu, Q. Chen, X. Chen, K. Huang, and Y . Fang, “PartialLoading: User scheduling and bandwidth allocation for parameter-sharing edge inference,” arXiv preprint arXiv:2503.22982 , 2025

  39. [39]

    Edge intel- ligence: Architectures, challenges, and applications,

    D. Xu, T. Li, Y . Li, X. Su, S. Tarkoma, and P. Hui, “Edge intel- ligence: Architectures, challenges, and applications,” arXiv preprint arXiv:2003.12172, 2020

  40. [40]

    Cached model-as-a-resource: Provisioning large language model agents for edge intelligence in space-air-ground integrated networks,

    M. Xu, D. Niyato, H. Zhang, J. Kang, Z. Xiong, S. Mao, and Z. Han, “Cached model-as-a-resource: Provisioning large language model agents for edge intelligence in space-air-ground integrated networks,” arXiv preprint arXiv:2403.05826, 2024

  41. [41]

    Pipelining split learning in multi-hop edge networks,

    W. Wei, Z. Lin, T. Li, X. Li, and X. Chen, “Pipelining split learning in multi-hop edge networks,” arXiv preprint arXiv:2505.04368 , 2025

  42. [42]

    Split learning in 6G edge networks,

    Z. Lin, G. Qu, X. Chen, and K. Huang, “Split learning in 6G edge networks,” IEEE Wireless Commun., vol. 31, no. 4, pp. 170–176, Aug. 2024

  43. [43]

    QoS-aware placement of deep learning services on the edge with multiple service implemen- tations,

    N. Hudson, H. Khamfroush, and D. E. Lucani, “QoS-aware placement of deep learning services on the edge with multiple service implemen- tations,” in Proc. Int. Conf. on Comput. Commun. and Netw. (ICCCN) , Athens, Greece, Jul. 2021, pp. 1–8

  44. [44]

    In-situ model downloading to realize versatile edge AI in 6G mobile networks,

    K. Huang, H. Wu, Z. Liu, and X. Qi, “In-situ model downloading to realize versatile edge AI in 6G mobile networks,” IEEE Commun. Mag., vol. 30, no. 3, pp. 96–102, 2023

  45. [45]

    Mobile edge intelligence for large language models: A contemporary survey,

    G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, and K. Huang, “Mobile edge intelligence for large language models: A contemporary survey,” IEEE Commun. Surveys Tuts., pp. 1–42, early access 2025

  46. [46]

    Efficient multiuser AI downloading via reusable knowledge broadcasting,

    H. Wu, Q. Zeng, and K. Huang, “Efficient multiuser AI downloading via reusable knowledge broadcasting,” IEEE Trans. Wireless Commun., vol. 23, no. 8, pp. 10 459–10 472, Aug. 2024

  47. [47]

    Collaborative edge caching in context-aware device-to-device networks,

    X. Zhao, P. Yuan, H. li, and S. Tang, “Collaborative edge caching in context-aware device-to-device networks,” IEEE Trans. Veh. Technol. , vol. 67, no. 10, pp. 9583–9596, Oct. 2018

  48. [48]

    Soft cache hits: Improving performance through recommendation and delivery of related content,

    P. Sermpezis, T. Giannakas, T. Spyropoulos, and L. Vigneri, “Soft cache hits: Improving performance through recommendation and delivery of related content,” IEEE J. Sel. Areas Commun. , vol. 36, no. 6, pp. 1300– 1313, Jun. 2018

  49. [49]

    Multimedia caching strategies for hetero- geneous application and server environments,

    A. Dan and D. Sitaram, “Multimedia caching strategies for hetero- geneous application and server environments,” Multimed. Tools Appl. , vol. 4, pp. 279–312, May 1997

  50. [50]

    Jointly optimizing content caching and recommendations in small cell net- works,

    L. E. Chatzieleftheriou, M. Karaliopoulos, and I. Koutsopoulos, “Jointly optimizing content caching and recommendations in small cell net- works,” IEEE Trans. Mobile Comput. , vol. 18, no. 1, pp. 125–138, Jan. 2019

  51. [51]

    A survey of caching mechanisms in information-centric networking,

    M. Zhang, H. Luo, and H. Zhang, “A survey of caching mechanisms in information-centric networking,” IEEE Commun. Surveys Tuts., vol. 17, no. 3, pp. 1473–1499, 3rd Quart. 2015

  52. [52]

    Fujishige, Submodular functions and optimization

    S. Fujishige, Submodular functions and optimization . New York, NY , USA: Elsevier, 2005

  53. [53]

    Lov ´asz, Mathematical Programming The State of the Art: Bonn

    L. Lov ´asz, Mathematical Programming The State of the Art: Bonn

  54. [54]

    Submodular functions and convexity, pp

    Berlin, Heidelberg: Springer, 1983, ch. Submodular functions and convexity, pp. 235–257

  55. [55]

    Autotune: Automatically tuning convolutional neural networks for improved transfer learning,

    S. S. Basha, S. K. Vinakota, V . Pulabaigari, S. Mukherjee, and S. R. Dubey, “Autotune: Automatically tuning convolutional neural networks for improved transfer learning,” Neural Netw., vol. 133, pp. 112–122, Jan. 2021

  56. [56]

    A comprehensive survey on transfer learning,

    F. Zhuang, Z. Qi, K. Duan, D. Xi, Y . Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proc. IEEE, vol. 109, no. 1, pp. 43–76, Jan. 2021

  57. [57]

    Models and pre-trained weights

    PyTorch, “Models and pre-trained weights.” [Online]. Available: https://docs.pytorch.org/vision/main/models.html

  58. [58]

    Introducing Apple’s on-device and server foundation models,

    Apple, “Introducing Apple’s on-device and server foundation models,”

  59. [59]

    Available: https://machinelearning.apple.com/research/ introducing-apple-foundation-models

    [Online]. Available: https://machinelearning.apple.com/research/ introducing-apple-foundation-models

  60. [60]

    Serving customized AI models at scale with LoRA,

    IBM, “Serving customized AI models at scale with LoRA,” 2024. [Online]. Available: https://research.ibm.com/blog/LoRAs-explained

  61. [61]

    Submodular optimization with submodular cover and submodular knapsack constraints,

    R. K. Iyer and J. A. Bilmes, “Submodular optimization with submodular cover and submodular knapsack constraints,” in Proc. Adv. Neural Inform. Process. Syst. (NeurIPS) , Stateline, NV , USA, Dec. 2013, pp. 1–9

  62. [62]

    Fast semidifferential-based submod- ular function optimization,

    R. Iyer, S. Jegelka, and J. Bilmes, “Fast semidifferential-based submod- ular function optimization,” in Proc. Int. Conf. Mach. Learn. (ICML) , Atlanta, USA, Jun. 2013, pp. 855–863

  63. [63]

    3rd generation partnership project; Technical specification group radio access network; NR; Base station (BS) radio transmission and reception; (Release 18),

    3GPP, “3rd generation partnership project; Technical specification group radio access network; NR; Base station (BS) radio transmission and reception; (Release 18),” 3rd Generation Partnership Project (3GPP), Technical Specification (TS) 38.104, Dec. 2024, version 18.8.0

  64. [64]

    Mobile backhaul: An overview,

    GSMA, “Mobile backhaul: An overview,” 2019. [Online]. Available: https://www.gsma.com/futurenetworks/wiki/ mobile-backhaul-an-overview/

  65. [65]

    Ultra- dense 5G small cell deployment for fiber and wireless backhaul-aware infrastructures,

    A. L. Rezaabad, H. Beyranvand, J. A. Salehi, and M. Maier, “Ultra- dense 5G small cell deployment for fiber and wireless backhaul-aware infrastructures,” IEEE Trans. Veh. Technol., vol. 67, no. 12, pp. 12 231– 12 243, Dec. 2018

  66. [66]

    Spectral efficiency analysis of cell-free massive MIMO systems with zero-forcing detector,

    P. Liu, K. Luo, D. Chen, and T. Jiang, “Spectral efficiency analysis of cell-free massive MIMO systems with zero-forcing detector,” IEEE Trans. Wireless Commun., vol. 19, no. 2, pp. 795–807, Feb. 2020

  67. [67]

    Relative frequency as a determinant of phonetic change,

    G. K. Zipf, “Relative frequency as a determinant of phonetic change,” Harvard Studies in Classical Philology , vol. 40, pp. 1–95, 1929

  68. [68]

    Social and spatial proactive caching for mobile data offloading,

    E. Bas ¸tu˘g, M. Bennis, and M. Debbah, “Social and spatial proactive caching for mobile data offloading,” in Proc. IEEE Int. Conf.Commun. Workshops (ICC Wkshps), Sydney, NSW, Australia, Jun. 2014, pp. 581– 586

  69. [69]

    Learning-aided content placement in caching-enabled fog computing systems using thompson sampling,

    J. Zhu, X. Huang, and Z. Shao, “Learning-aided content placement in caching-enabled fog computing systems using thompson sampling,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) , Barcelona, Spain, May 2020, pp. 5060–5064

  70. [70]

    A survey of ICN content naming and in-network caching in 5G and beyond networks,

    O. Serhane, K. Yahyaoui, B. Nour, and H. Moungla, “A survey of ICN content naming and in-network caching in 5G and beyond networks,” IEEE Internet Things J. , vol. 8, no. 6, pp. 4081–4104, Mar. 2021

  71. [71]

    On caching and routing in information-centric net- works,

    A. Seetharam, “On caching and routing in information-centric net- works,” IEEE Commun. Mag. , vol. 56, no. 3, pp. 204–209, Mar. 2018

  72. [72]

    Approximability issues for unconstrained and constrained maximization of half-product related functions,

    H. Kellerer, R. Sarto Basso, and V . A. Strusevich, “Approximability issues for unconstrained and constrained maximization of half-product related functions,” Theor. Comput. Sci., vol. 659, pp. 64–71, Jan. 2017

  73. [73]

    Approximation algorithms for the multiple knapsack problem with assignment restrictions,

    M. Dawande, J. Kalagnanam, P. Keskinocak, F. S. Salman, and R. Ravi, “Approximation algorithms for the multiple knapsack problem with assignment restrictions,” J. Comb. Optim. , vol. 4, pp. 171–186, 2000

  74. [74]

    A polynomial time approximation scheme for the multiple knapsack problem,

    C. Chekuri and S. Khanna, “A polynomial time approximation scheme for the multiple knapsack problem,” SIAM J. Comput. , vol. 35, no. 3, pp. 713–728, 2005. 1 APPENDIX A PROOF OF PROPOSITION 1 We begin by introducing a few statements. For any fea- sible X, let η (X) = {xm,i | xm,i = 1, xm,i ∈ X} denote the set of model caching decisions with xm,i = 1 , IX =...