pith. machine review for the scientific record. sign in

arxiv: 2604.16583 · v1 · submitted 2026-04-17 · 💻 cs.LG · cs.AI

Recognition: unknown

POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords adaptercacheadapterspolarroutinglearningloraonline
0
0 comments X

The pith

POLAR formulates joint LoRA adapter caching and routing as a two-timescale contextual bandit, achieving sublinear regret bounds and outperforming non-adaptive baselines in experiments with real adapters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Edge devices running large language models often use many small LoRA adapters to customize responses for different tasks, but only a few can stay loaded in fast memory at once. Loading an adapter from storage adds delay for each request. POLAR splits decisions into two timescales: a slow one that picks which adapters to keep in memory, and a fast one that routes each request to an adapter whose usefulness depends on the request details. It models this as a contextual bandit problem where the router learns from feedback while the cache controller updates less often. One version uses fixed time periods between cache updates for simple guarantees. The improved version doubles those periods and adds forced exploration to get better performance. Tests with 15 real adapters on a 7B model and actual paging times showed the adaptive approach reduced delays compared to keeping adapters fixed or using simple rules.

Core claim

POLAR+ achieves sublinear regret of order O(d sqrt(NT) + sqrt(KT)) under stochastic regularity and cacheability conditions, with the routing term matching standard contextual bandit rates up to logs, and experiments show adaptive cache control substantially outperforms non-adaptive baselines.

Load-bearing premise

The stochastic regularity and cacheability conditions on adapter utilities and contexts hold in practice, allowing the epoch-doubling and forced exploration to deliver the stated regret bounds without excessive exploration cost.

Figures

Figures reproduced from arXiv: 2604.16583 by Jian Li, Shaoang Li.

Figure 1
Figure 1. Figure 1: Overview of POLAR+. The cache controller (slow timescale) updates the resident set 𝐶ℓ at each epoch boundary; the LinUCB router (fast timescale) selects an adapter for each request 𝑥𝑡 . The library M holds 𝑁 adapters on SSD; the GPU keeps a resident set of size 𝐾 that evolves 𝐶1 → 𝐶2 → 𝐶3 → 𝐶4 ≈ 𝐶 † as epoch lengths double (green “+” marks newly admitted adapters). Request 𝑥3 hits a cached adapter (hot pat… view at source ↗
Figure 5
Figure 5. Figure 5: Operational regret decomposition vs. 𝛼 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cumulative cache updates. 1 2 3 4 5 6 7 Cache Size K 500 1000 1500 2000 R egret(T) POLAR+ POLAR Static Cache [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cache learning diagnostic. Left: Jaccard(𝐶ℓ(𝑡) ,𝐶★ 𝑇 ) over 𝑡. Right: quality loss of using 𝐶ℓ(𝑡) on the next 2,000 con￾texts. POLAR+ converges rapidly to a high-quality resident set; frequency-based policies do not. eventually reaches a comparable Jaccard overlap, but only near 𝑡 ≈ 3 × 104 , and its steady-state quality loss remains around 5.7, roughly twice that of POLAR+. LRU and LFU, which react to re… view at source ↗
read the original abstract

Edge deployment of large language models (LLMs) increasingly relies on libraries of lightweight LoRA adapters, yet GPU/DRAM can keep only a small resident subset at a time. Serving a request through a non-resident adapter requires paging its weights from storage, incurring measurable latency. This creates a two-timescale online control problem: on a slow timescale, the system selects which adapters remain resident in fast memory, while on a fast timescale it routes each request to an adapter whose context-dependent utility is unknown a priori. The two decisions are tightly coupled: the cache determines the cost of exploration, and the router determines which adapters receive informative feedback. We formulate this joint caching-and-routing problem as a two-timescale contextual bandit and propose POLAR (Paging and Online Learning for Adapter Routing). POLAR pairs a cache-aware LinUCB router with an epoch-based cache controller. We study two variants. A fixed-epoch version provides a robust baseline with worst-case regret guarantees under arbitrary contexts. An epoch-doubling version, POLAR+, adds forced exploration and improved cache optimization to achieve $\widetilde{\mathcal{O}}(d\sqrt{NT}+\sqrt{KT})$ sublinear regret under stochastic regularity and cacheability conditions, where $N$ is the adapter count, $K$ the cache size, $d$ the context dimension, and $T$ the horizon. The routing term matches the standard contextual-bandit rate up to logarithmic factors, showing that the memory hierarchy does not fundamentally slow routing learning. Experiments using 15 real LoRA adapters for Qwen2.5-7B together with measured GPU paging latencies show that adaptive cache control substantially outperforms non-adaptive baselines and exhibits scaling trends consistent with the theory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces POLAR for joint online caching and routing of LoRA adapters in edge LLM serving. It formulates the problem as a two-timescale contextual bandit, pairs a cache-aware LinUCB router with an epoch-based cache controller, and analyzes two variants: a fixed-epoch version with worst-case guarantees and POLAR+ with epoch-doubling that achieves sublinear regret of order ~O(d sqrt(NT) + sqrt(KT)) under stochastic regularity and cacheability conditions. Experiments with 15 real LoRA adapters on Qwen2.5-7B and measured GPU paging latencies show adaptive caching outperforming non-adaptive baselines.

Significance. If the regret analysis holds and the stochastic assumptions are realistic for LLM workloads, the work is significant for demonstrating that memory-hierarchy constraints need not degrade contextual-bandit routing rates beyond logarithmic factors. The explicit coupling of cache decisions to exploration cost and the use of measured paging latencies provide a concrete bridge between online learning theory and systems practice in multi-adapter edge serving.

major comments (2)
  1. [Abstract] Abstract: The sublinear regret claim for POLAR+ rests on 'stochastic regularity and cacheability conditions' that are load-bearing for the bound but are only named, not defined or justified, in the abstract. The main text must supply their precise mathematical statement (e.g., lower bounds on adapter utility gaps or context distribution regularity) together with a discussion of when they plausibly hold for real request traces; otherwise the practical scope of the O~(d sqrt(NT) + sqrt(KT)) result remains unclear.
  2. [Theory section (regret analysis)] The two-timescale analysis (presumably §4): while the abstract states that the routing term matches standard LinUCB rates up to logs, the interaction between the slow cache controller and the fast router must be shown not to introduce additional linear or super-logarithmic terms in the regret decomposition. A concrete walk-through of how the epoch-doubling schedule and forced exploration interact with the cache state is required to confirm the claimed separation.
minor comments (2)
  1. [Notation and preliminaries] Ensure that all symbols (N, K, d, T, and the precise definition of the utility function) are introduced with a single consistent notation block early in the paper rather than scattered across the abstract and later sections.
  2. [Experiments] The experimental claims would be strengthened by including a table or figure that reports both the observed latency overheads and the empirical regret curves side-by-side with the theoretical scaling predictions.

Circularity Check

0 steps flagged

No significant circularity; regret bounds extend standard LinUCB analysis independently.

full rationale

The paper formulates the joint caching-routing problem as a two-timescale contextual bandit and derives the POLAR+ regret bound of order O(d sqrt(NT) + sqrt(KT)) by extending the standard LinUCB analysis with epoch-doubling, forced exploration, and cache terms under explicit stochastic regularity and cacheability assumptions. The routing component matches known contextual-bandit rates up to logs, and the cache controller is analyzed separately without reducing the main result to fitted parameters or self-referential definitions. No load-bearing self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the derivation chain; the analysis is self-contained against external benchmarks for contextual bandits.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard contextual bandit assumptions plus additional stochastic regularity and cacheability conditions on adapter utilities; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Adapter utilities are stochastic with regularity conditions allowing sublinear regret for LinUCB-style algorithms
    Invoked to obtain the stated regret bound for the router component
  • domain assumption Cacheability conditions hold so that epoch-based updates can control paging costs without excessive regret
    Required for the cache controller analysis and the combined bound

pith-pipeline@v0.9.0 · 5611 in / 1353 out tokens · 32425 ms · 2026-05-10T08:47:15.991099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved Algo- rithms for Linear Stochastic Bandits. InAdvances in Neural Information Processing Systems 24. 2312–2320

  2. [2]

    Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. InProceedings of the 30th International Conference on Machine Learning. 127–135

  3. [3]

    2005.Online computation and competitive analysis

    Allan Borodin and Ran El-Yaniv. 2005.Online computation and competitive analysis. cambridge university press

  4. [4]

    Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Kr- ishnamurthy. 2024. Punica: Multi-Tenant LoRA Serving. InProceedings of the Seventh Annual Conference on Machine Learning and Systems

  5. [5]

    Schapire

    Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandits with Linear Payoff Functions. InProceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). 208–214

  6. [6]

    Weibo Chu, Xiaoyan Zhang, Xinming Jia, John CS Lui, and Zhiyong Wang

  7. [7]

    Online optimal service caching for multi-access edge computing: A con- strained multi-armed bandit optimization approach.Computer Networks246 (2024), 110395

  8. [8]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations

  9. [9]

    Shaoang Li and Jian Li. 2026. Near-Optimal Online Deployment and Routing for Streaming LLMs. InThe Fourteenth International Conference on Learning Representations

  10. [10]

    Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, and Wei Wang. 2024. Caraserve: Cpu-assisted and rank-aware lora serving for generative llm inference.arXiv preprint arXiv:2401.11240(2024)

  11. [11]

    Yang Li. 2025. LLM Bandit: Cost-Efficient LLM Generation via Preference- Conditioned Dynamic Routing.arXiv preprint arXiv:2502.02743(2025)

  12. [12]

    Lui, Wei Chen, and Carlee Joe-Wong

    Xutong Liu, Baran Atalar, XiangXiang Dai, Jinhang Zuo, Siwei Wang, John C.S. Lui, Wei Chen, and Carlee Joe-Wong. 2026. Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation. InIEEE International Conference on Computer Communications (INFOCOM)

  13. [13]

    Yinan Ni, Xiao Yang, Yuqi Tang, Zhimin Qiu, Chen Wang, and Tingzhou Yuan

  14. [14]

    InProceedings of the 6th International Conference on Computer Science and Management Technology

    Predictive-LoRA: A proactive and fragmentation-aware serverless inference system for LLMs. InProceedings of the 6th International Conference on Computer Science and Management Technology. 1267–1273

  15. [15]

    Georgios S Paschos, Apostolos Destounis, and George Iosifidis. 2020. Online convex optimization for caching networks.IEEE/ACM Transactions on Networking 28, 2 (2020), 625–638

  16. [16]

    Manhin Poon, XiangXiang Dai, Xutong Liu, Fang Kong, John CS Lui, and Jinhang Zuo. 2026. Online multi-llm selection via contextual bandits under unstructured context evolution. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 24855–24863

  17. [17]

    Guocong Quan, Atilla Eryilmaz, and Ness B Shroff. 2024. Minimizing edge caching service costs through regret-optimal online learning.IEEE/ACM Transactions on Networking32, 5 (2024), 4349–4364

  18. [18]

    Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025. EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. InProceedings of the 23rd Annual International Conference on Mobile Systems, Applications and Services. 138–153

  19. [19]

    Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. 2023. S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285 (2023)

  20. [20]

    Daniel D Sleator and Robert E Tarjan. 1985. Amortized efficiency of list update and paging rules.Commun. ACM28, 2 (1985), 202–208

  21. [21]

    Yifan Sui, Hao Wang, Hanfei Yu, Yitao Hu, and Jianxun Li. 2025. ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs. arXiv preprint arXiv:2505.14468(2025)

  22. [22]

    Chunlin Tian, Xinpeng Qin, Kahou Tam, Li Li, Zijian Wang, Yuanzhe Zhao, Minglei Zhang, and Chengzhong Xu. 2025. CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge. In2025 USENIX Annual Technical Conference. 563–585

  23. [23]

    Joel A Tropp. 2012. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics12, 4 (2012), 389–434

  24. [24]

    Wang Wei, Tiankai Yang, Hongjie Chen, Yue Zhao, Franck Dernoncourt, Ryan A Rossi, and Hoda Eldardiry. 2025. Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs.arXiv preprint arXiv:2510.07429(2025)

  25. [25]

    Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. dLoRA: Dynamically orchestrating requests and adapters for LoRALLM serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 911–927

  26. [26]

    Minrui Xu, Dusit Niyato, and Christopher G Brinton. 2026. Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading.IEEE Transactions on Networking(2026)

  27. [27]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  28. [28]

    Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo

  29. [29]

    Improving the serving performance of multi-lora large language models via efficient lora and kv cache management.arXiv preprint arXiv:2505.03756(2025)

  30. [30]

    Xianzhi Zhang, Yue Xu, Yinlin Zhu, Di Wu, Yipeng Zhou, Miao Hu, and Guocong Quan. 2026. Adapter-augmented bandits for online multi-constrained multi- modal inference scheduling.arXiv preprint arXiv:2603.06403(2026)

  31. [31]

    Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael Jordan, and Jiantao Jiao. 2023. Towards Optimal Caching and Model Selection for Large Model Inference. InThirty-seventh Conference on Neural Information Processing Systems. 10 POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving Conference’17, July 2017, Washington, ...