pith. machine review for the scientific record. sign in

arxiv: 2604.15357 · v1 · submitted 2026-04-11 · 💻 cs.AR · cs.AI· cs.DC

Recognition: no theorem link

Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:44 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DC
keywords latency estimationmobile edge computingCPU-GPU couplingDVFSmodel inferenceasynchronous parallelismprofiling reductiondeadline-aware scheduling
0
0 comments X

The pith

A layer-wise model of CPU-GPU waits predicts full inference latency at every frequency pair from only a few profiled samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mobile devices change CPU and GPU speeds continuously to manage power and heat, which makes it impossible to rely on a single latency number for any given model. Exhaustive measurement across all frequency pairs is feasible for small networks but becomes days of work for small language models with varying context lengths. The paper demonstrates that measuring each layer's parallel overlap and then summing the idle gaps created when the CPU and GPU wait on each other produces accurate predictions for the entire model. Because the model is built from the bottom up, the same equations apply to both conventional neural nets and language models without retraining. The resulting estimates are accurate enough to drive a deadline-aware frequency controller that meets timing targets while using less energy than earlier methods.

Core claim

FLAME uses layer-wise modeling that quantifies the overlapping parallelism and then aggregates dynamic pipeline bubbles caused by asynchronous processor interactions when extending to the full model. This bottom-up approach ensures generalizability across diverse models from DNNs to SLMs, and its precise modeling allows for profiling a sparse subset of samples, cutting DNN profiling from hours to minutes and SLM profiling from days to mere minutes, while maintaining small estimation errors across frequencies.

What carries the argument

layer-wise modeling that quantifies overlapping parallelism and aggregates dynamic pipeline bubbles from asynchronous CPU kernel launches and GPU execution

If this is right

  • Profiling effort for DNNs falls from hours to minutes while keeping estimation error small across frequencies.
  • Profiling effort for SLMs falls from days to minutes while keeping estimation error small across frequencies.
  • A deadline-aware DVFS policy built on these estimates meets latency targets more reliably and at lower power than prior approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bottom-up accounting of processor waits could be used to estimate energy consumption rather than latency in the same hardware setting.
  • The approach suggests that frequency dependence arises mainly from coordination gaps rather than from simple per-processor scaling, so similar modeling may apply to other heterogeneous pairs such as CPU-plus-NPU.
  • Because only sparse samples are needed, the technique opens the possibility of on-device recalibration when a new model is downloaded or when hardware ages.

Load-bearing premise

The same layer-wise equations for parallelism and bubbles will produce small errors on new models and hardware even when only a sparse set of frequency points is measured.

What would settle it

Run the sparse-profiling procedure on a new model or device, then compare the predicted latencies against exhaustive measurements at many frequency pairs; large errors at multiple points would show the generalizability does not hold.

Figures

Figures reproduced from arXiv: 2604.15357 by Jiesong Chen, Jun You, Zhenjiang Li, Zhidan Liu.

Figure 1
Figure 1. Figure 1: Dynamic timing factor ∆ℓ(fc, fg), denoted as ∆ℓ in the above figure for clear illustration, exists when a mobile edge device processes any model layer ℓ due to asynchronous interaction between its CPU and GPU. This asynchrony leads to (a) overlapping and (b) idle waiting between the CPU and GPU execution times. execution time. Instead, it consists of three distinct compo￾nents: the CPU processing time Tℓ(f… view at source ↗
Figure 3
Figure 3. Figure 3: Latency estimation error of existing methods for (a) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: CDF of estimating independent (a) CPU and (b) GPU [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Estimation error of (a) each layer in ResNet50 and (b) [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Estimated (Est) model-wise latency of FLAME com [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Deadline-aware DVFS vs. other strategies. [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Overall inference latency esti￾mation performance on (a) DNN models and (b) SLM models. 20 33 50 0 50 100 100 100 100 100 100 100 100 100 100 100 100 100 (a) FLAME MAX Com zTT 20 33 50 0 50 100 100 100 100 100 100 100 100 100 77.9 100 100 100 (b) 20 33 50 0 50 100 100 100 100 100 100 100 100 100 88.4 100 100 100 (c) QoS (%) Deadline Rate (frames/s) [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 15
Figure 15. Figure 15: PPW of governors at different deadline rates on (a) GPT2-large, (b) Qwen2-1.5B and (c) Qwen2-7B. ResNet50 VGG16 DenseNet121 0 25 50 75 6.4 9.0 11.7 67.7 16.2 32.5 30.0 12.2 21.9 (a) FLAME w/o module w/o aggregation GPT2-large Qwen2-1.5B Qwen2-7B 0 25 50 75 3.8 11.5 6.4 46.7 64.4 49.2 14.2 30.1 21.1 (b) Latency Estimation Error (%) [PITH_FULL_IMAGE:figures/full_fig_p009_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Impact of sampling interval of (a) CPU and (b) GPU frequency, and (c) context length of GPT2. AGX Orin Orin NX 0 20 40 60 6.4 12.3 25.3 53.0 31.2 31.9 (a) FLAME Lat-Analytic Lat-Learn AGX Orin Orin NX 0 20 40 60 3.8 9.8 30.9 25.4 22.8 18.9 (b) Latency Estimation Error (%) [PITH_FULL_IMAGE:figures/full_fig_p010_17.png] view at source ↗
Figure 20
Figure 20. Figure 20: Varying deadlines with (a) ResNet50 and (b) GPT2- [PITH_FULL_IMAGE:figures/full_fig_p010_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Online adaptation for (a) ResNet50 and (b) GPT2- [PITH_FULL_IMAGE:figures/full_fig_p011_21.png] view at source ↗
read the original abstract

Precise estimation of model inference latency is crucial for time-critical mobile edge applications, enabling devices to calculate latency margins against deadlines and trade them for enhanced model performance or resource savings. However, the ubiquity of Dynamic Voltage and Frequency Scaling (DVFS) renders traditional static profiling invalid in real-world deployments, as inference latency fluctuates with varying processor (CPU and GPU) frequencies. While extensive profiling across frequency combinations is theoretically possible, it is prohibitively expensive, particularly for emerging Small Language Models (SLMs), where variable context lengths explode the profiling up to days. We observe that simple analytic scaling fails to predict these fluctuations due to the complex asynchronous coupling between CPU (kernel launching) and GPU (execution). In this paper, we introduce FLAME to accurately estimate inference latency across frequency combinations. It features a novel layer-wise modeling that quantifies the overlapping parallelism and then aggregates dynamic pipeline bubbles caused by asynchronous processor interactions when extending to the full model. This bottom-up approach ensures generalizability across diverse models from DNNs to SLMs, and its precise modeling allows for profiling a sparse subset of samples, cutting DNN profiling from hours to minutes and SLM profiling from days to mere minutes, while maintaining small estimation errors across frequencies. We further showcase FLAME's utility in a deadline-aware DVFS, outperforming the state-of-the-art approach in both power efficiency and latency guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces FLAME, a bottom-up layer-wise modeling approach for estimating inference latency of DNNs and SLMs on mobile edge devices under varying CPU/GPU frequencies. It quantifies overlapping parallelism per layer and aggregates dynamic pipeline bubbles arising from asynchronous CPU (kernel launch) and GPU (execution) interactions to predict full-model latency. The method supports sparse frequency sampling for profiling (reducing DNN profiling from hours to minutes and SLM profiling from days to minutes) while claiming small errors across frequencies, and demonstrates utility in deadline-aware DVFS that outperforms SOTA in power efficiency and latency guarantees.

Significance. If the modeling and empirical results hold, the work addresses a practical barrier in mobile edge AI: accurate latency prediction under DVFS without exhaustive profiling. The layer-wise decomposition and explicit handling of asynchrony provide a generalizable alternative to analytic scaling or black-box fits, with direct applicability to time-critical applications. The reported reduction in profiling cost and the DVFS case study are concrete strengths if supported by reproducible experiments across model classes.

major comments (2)
  1. [§4 (Modeling) or §5 (Extension to full model)] The central claim of small estimation errors and generalizability to SLMs rests on the layer-wise quantification of overlapping parallelism and bubble aggregation; without explicit equations or pseudocode for the aggregation step (likely in §4 or §5), it is difficult to verify that the pipeline-bubble model is not implicitly fitted to the evaluated frequencies rather than derived bottom-up.
  2. [§6 (Experiments) or §7 (DVFS evaluation)] Table or figure reporting cross-frequency errors (e.g., MAPE or 95th-percentile latency error) for the sparse-sampling regime must be checked against the exhaustive baseline; if the sparse subset is chosen post-hoc rather than via a fixed, model-independent rule, the claimed reduction from hours/days to minutes risks being non-reproducible for new models.
minor comments (3)
  1. [Abstract] Abstract states 'small estimation errors' and 'outperforming SOTA' without any numeric values or baselines; moving at least one key quantitative result (e.g., average error or energy saving) into the abstract would improve clarity.
  2. [§3 or §4] Notation for CPU-GPU overlap and bubble duration should be defined once with consistent symbols; currently the description mixes 'overlapping parallelism' and 'pipeline bubbles' without a single equation linking them.
  3. [§6] Ensure all evaluated models, context lengths for SLMs, frequency ranges, and hardware platforms are listed in a single table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and reproducibility of our work. We address each major comment below and have prepared revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4 (Modeling) or §5 (Extension to full model)] The central claim of small estimation errors and generalizability to SLMs rests on the layer-wise quantification of overlapping parallelism and bubble aggregation; without explicit equations or pseudocode for the aggregation step (likely in §4 or §5), it is difficult to verify that the pipeline-bubble model is not implicitly fitted to the evaluated frequencies rather than derived bottom-up.

    Authors: We appreciate the referee's emphasis on verifiability. Section 4 already presents the layer-wise equations quantifying overlapping parallelism between CPU kernel launches and GPU executions. Section 5 describes the bottom-up aggregation of dynamic pipeline bubbles arising from asynchronous CPU-GPU interactions across layers. To make the derivation fully transparent and address the concern about potential implicit fitting, we will add explicit aggregation equations and pseudocode in the revised manuscript. These additions will demonstrate that the model is analytically constructed from per-layer measurements rather than tuned to the evaluated frequency points. revision: yes

  2. Referee: [§6 (Experiments) or §7 (DVFS evaluation)] Table or figure reporting cross-frequency errors (e.g., MAPE or 95th-percentile latency error) for the sparse-sampling regime must be checked against the exhaustive baseline; if the sparse subset is chosen post-hoc rather than via a fixed, model-independent rule, the claimed reduction from hours/days to minutes risks being non-reproducible for new models.

    Authors: We agree that a clear, reproducible selection rule is necessary. The sparse subset follows a fixed, model-independent rule: frequencies are sampled at uniform intervals (every 200 MHz for both CPU and GPU, yielding a sparse grid independent of any model-specific characteristics). We will insert a new table in Section 6 that directly compares MAPE and 95th-percentile latency errors between the sparse-sampling regime and the exhaustive baseline for all DNNs and SLMs. This addition will confirm both the accuracy and the substantial reduction in profiling time while ensuring the method can be applied to new models without post-hoc adjustments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a bottom-up layer-wise modeling approach that quantifies overlapping parallelism between CPU and GPU and aggregates frequency-dependent pipeline bubbles arising from their asynchronous coupling. This construction is described as independent and directly enabling generalizability across models and sparse sampling, with no equations, definitions, or steps that reduce predictions to fitted inputs by construction, no self-citation load-bearing premises, and no renaming of known results or smuggled ansatzes. The derivation chain remains self-contained against the stated modeling principles.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no equations, parameters, or explicit assumptions are stated in the provided text, so the ledger remains empty. The modeling is presented as a novel bottom-up construction but its internal axioms and free parameters cannot be audited.

pith-pipeline@v0.9.0 · 5555 in / 1255 out tokens · 51818 ms · 2026-05-10T15:44:42.719032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 2 canonical work pages

  1. [1]

    Pervasive ai for iot applications: A survey on resource-efficient distributed artificial intelligence,

    E. Baccour, N. Mhaisen, A. A. Abdellatif, A. Erbad, A. Mohamed, M. Hamdi, and M. Guizani, “Pervasive ai for iot applications: A survey on resource-efficient distributed artificial intelligence,”IEEE Communications Surveys & Tutorials, 2022

  2. [2]

    Modality plug-and-play: Runtime modality adaptation in llm-driven autonomous mobile systems,

    K. Huang, X. Yin, H. Huang, and W. Gao, “Modality plug-and-play: Runtime modality adaptation in llm-driven autonomous mobile systems,” inProc. of ACM MobiCom, 2025

  3. [3]

    Edgemind-os: A plug-and-play embodied intelligence system for real-time on-device deployment,

    X. Ding, J. Wei, F. Jia, L. Mi, R. Ju, X. Wang, Y . Zheng, Z. Zhang, W. Wang, S. Jianget al., “Edgemind-os: A plug-and-play embodied intelligence system for real-time on-device deployment,” inProc. of ACM MobiCom, 2025

  4. [4]

    Deciphering the enigma of satellite computing with cots devices: Measurement and analysis,

    R. Xing, M. Xu, A. Zhou, Q. Li, Y . Zhang, F. Qian, and S. Wang, “Deciphering the enigma of satellite computing with cots devices: Measurement and analysis,” inProc. of ACM MobiCom, 2024

  5. [5]

    A uav-assisted edge framework for real-time disaster management,

    H. Ijaz, R. Ahmad, R. Ahmed, W. Ahmed, Y . Kai, and W. Jun, “A uav-assisted edge framework for real-time disaster management,”IEEE Transactions on Geoscience and Remote Sensing, 2023

  6. [6]

    Nn-meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices,

    L. L. Zhang, S. Han, J. Wei, N. Zheng, T. Cao, Y . Yang, and Y . Liu, “Nn-meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices,” inProc. of ACM MobiSys, 2021

  7. [7]

    Rt-mdl: Supporting real-time mixed deep learning tasks on edge platforms,

    N. Ling, K. Wang, Y . He, G. Xing, and D. Xie, “Rt-mdl: Supporting real-time mixed deep learning tasks on edge platforms,” inProc. of ACM SenSys, 2021

  8. [8]

    Ztt: learning-based dvfs with zero thermal throttling for mobile devices,

    S. Kim, K. Bin, S. Ha, K. Lee, and S. Chong, “Ztt: learning-based dvfs with zero thermal throttling for mobile devices,” inProc. of ACM MobiSys, 2021

  9. [9]

    Pantheon: Preemptible multi-dnn inference on mobile edge gpus,

    L. Han, Z. Zhou, and Z. Li, “Pantheon: Preemptible multi-dnn inference on mobile edge gpus,” inProc. of ACM MobiSys, 2024

  10. [10]

    Graphics-aware power governing for mobile devices,

    Y . Choi, S. Park, and H. Cha, “Graphics-aware power governing for mobile devices,” inProc. of ACM MobiSys, 2019

  11. [11]

    Sysscale: Exploiting multi-domain dynamic voltage and frequency scaling for energy efficient mobile processors,

    J. Haj-Yahya, M. Alser, J. Kim, A. G. Ya ˘glıkc ¸ı, N. Vijaykumar, E. Rotem, and O. Mutlu, “Sysscale: Exploiting multi-domain dynamic voltage and frequency scaling for energy efficient mobile processors,” inProc. of ACM/IEEE ISCA, 2020

  12. [12]

    A workload-aware dvfs robust to concurrent tasks for mobile devices,

    C. Lin, K. Wang, Z. Li, and Y . Pu, “A workload-aware dvfs robust to concurrent tasks for mobile devices,” inProc. of ACM MobiCom, 2023

  13. [13]

    Small language models: Survey, measurements, and insights.arXiv preprint arXiv:2409.15790, 2024

    Z. Lu, X. Li, D. Cai, R. Yi, F. Liu, X. Zhang, N. D. Lane, and M. Xu, “Small language models: Survey, measurements, and insights,”arXiv preprint arXiv:2409.15790, 2024

  14. [14]

    Beyond the limits: a survey of techniques to extend the context length in large language models,

    X. Wang, M. Salmani, P. Omidi, X. Ren, M. Rezagholizadeh, and A. Eshaghi, “Beyond the limits: a survey of techniques to extend the context length in large language models,” ser. IJCAI ’24, 2024

  15. [15]

    J. L. Hennessy and D. A. Patterson,Computer architecture: a quantita- tive approach, 2011

  16. [16]

    Minimizing gpu kernel launch overhead in deep learning inference on mobile gpus,

    S. Kim, S. Oh, and Y . Yi, “Minimizing gpu kernel launch overhead in deep learning inference on mobile gpus,” inProc. of ACM HotMobile, 2021

  17. [17]

    Dvfs-aware dnn inference on gpus: Latency modeling and performance analysis,

    Y . Han, Z. Nan, S. Zhou, and Z. Niu, “Dvfs-aware dnn inference on gpus: Latency modeling and performance analysis,” inProc. of IEEE ICC, 2025

  18. [18]

    Towards fine-grained dvfs in embedded multi-core cpus,

    G. Massari, F. Terraneo, M. Zanella, and D. Zoni, “Towards fine-grained dvfs in embedded multi-core cpus,” inProc. of ACM ARCS, 2018

  19. [19]

    Predicting inference time and energy consumption in deep learning using mlps,

    C. Lyu, M. Nourazar, and B. Goossens, “Predicting inference time and energy consumption in deep learning using mlps,” inProc. of IEEE ISPA, 2024

  20. [20]

    https://docs.pytorch.org/docs/main/notes/cuda.html

  21. [21]

    Energy-aware real-time scheduling in the linux kernel,

    C. Scordino, L. Abeni, and J. Lelli, “Energy-aware real-time scheduling in the linux kernel,” inProc. of ACM SAC, 2018

  22. [22]

    Ekya: Continuous learning of video analytics models on edge compute servers,

    R. Bhardwaj, Z. Xia, G. Ananthanarayanan, J. Jiang, Y . Shu, N. Kari- anakis, K. Hsieh, P. Bahl, and I. Stoica, “Ekya: Continuous learning of video analytics models on edge compute servers,” inProc. of USENIX NSDI, 2022

  23. [23]

    How many words do we read per minute? a review and meta-analysis of reading rate,

    M. Brysbaert, “How many words do we read per minute? a review and meta-analysis of reading rate,”Journal of memory and language, 2019

  24. [24]

    Power-rate-distortion analysis for wireless video communication under energy constraints,

    Z. He, Y . Liang, L. Chen, I. Ahmad, and D. Wu, “Power-rate-distortion analysis for wireless video communication under energy constraints,” IEEE transactions on circuits and systems for video technology, 2005

  25. [25]

    Efficient acceleration of deep learning inference on resource-constrained edge devices: A review,

    M. M. H. Shuvo, S. K. Islam, J. Cheng, and B. I. Morshed, “Efficient acceleration of deep learning inference on resource-constrained edge devices: A review,”Proceedings of the IEEE, 2022

  26. [26]

    Boosting dnn cold inference on edge devices,

    R. Yi, T. Cao, A. Zhou, X. Ma, S. Wang, and M. Xu, “Boosting dnn cold inference on edge devices,” inProc. of ACM MobiSys, 2023

  27. [27]

    Gpu scheduling on the nvidia tx2: Hidden details revealed,

    T. Amert, N. Otterness, M. Yang, J. H. Anderson, and F. D. Smith, “Gpu scheduling on the nvidia tx2: Hidden details revealed,” inIEEE Real-Time Systems Symposium (RTSS), 2017

  28. [28]

    Deep learning on mobile and embedded devices: State-of-the-art, challenges, and future directions,

    Y . Chen, B. Zheng, Z. Zhang, Q. Wang, C. Shen, and Q. Zhang, “Deep learning on mobile and embedded devices: State-of-the-art, challenges, and future directions,”ACM Computing Surveys (CSUR), 2020

  29. [29]

    Profile-based optimization of power performance by using dynamic voltage scaling on a pc cluster,

    Y . Hotta, M. Sato, H. Kimura, S. Matsuoka, T. Boku, and D. Takahashi, “Profile-based optimization of power performance by using dynamic voltage scaling on a pc cluster,” inIEEE International Parallel & Distributed Processing Symposium, 2006

  30. [30]

    Latency- aware dynamic voltage and frequency scaling on many-core architectures for data-intensive applications,

    Z. Lai, K. T. Lam, C.-L. Wang, J. Su, Y . Yan, and W. Zhu, “Latency- aware dynamic voltage and frequency scaling on many-core architectures for data-intensive applications,” in2013 International Conference on Cloud Computing and Big Data, 2013

  31. [31]

    A data-driven approach to lightweight dvfs-aware counter-based power modeling for heterogeneous platforms,

    S. Mazzola, T. Benz, B. Forsberg, and L. Benini, “A data-driven approach to lightweight dvfs-aware counter-based power modeling for heterogeneous platforms,” inProc. of SAMOS, 2022

  32. [32]

    Cupti event api,

    “Cupti event api,” https://docs.nvidia.com/cupti/main/main.html# sampling-events

  33. [33]

    Perf wiki,

    “Perf wiki,” https://perfwiki.github.io/main/

  34. [34]

    Mobile localization based on received signal strength and pearson’s correlation coefficient,

    H. Liu, Y . Zhang, X. Su, X. Li, and N. Xu, “Mobile localization based on received signal strength and pearson’s correlation coefficient,” International Journal of Distributed Sensor Networks, 2015

  35. [35]

    Xgboost: A scalable tree boosting system,

    T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” inProc. of ACM KDD, 2016. 12

  36. [36]

    Pipelined data-parallel cpu/gpu scheduling for multi-dnn real-time inference,

    Y . Xiang and H. Kim, “Pipelined data-parallel cpu/gpu scheduling for multi-dnn real-time inference,” in2019 IEEE Real-Time Systems Symposium (RTSS), 2019

  37. [37]

    Fine-grained dvfs using on-chip regula- tors,

    S. Eyerman and L. Eeckhout, “Fine-grained dvfs using on-chip regula- tors,”ACM Transactions on Architecture and Code Optimization, 2011

  38. [38]

    Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance,

    Y . Zhang, Q. Wang, Z. Lin, P. Xu, and B. Wang, “Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance,” inProc. of ACM EuroSys, 2024

  39. [39]

    Pytorch: An imperative style, high-performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antigaet al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, 2019

  40. [40]

    Transformers: State- of-the-art natural language processing,

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowiczet al., “Transformers: State- of-the-art natural language processing,” inProc. of ACM EMNLP, 2020

  41. [41]

    Scikit-learn: Machine learning in python,

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,”the Journal of machine Learning research, 2011

  42. [42]

    Torchvision the machine-vision package of torch,

    S. Marcel and Y . Rodriguez, “Torchvision the machine-vision package of torch,” inProc. of ACM MM, 2010

  43. [43]

    https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt

  44. [44]

    DARTS: Differentiable Architecture Search

    H. Liu, K. Simonyan, and Y . Yang, “Darts: Differentiable architecture search,”arXiv preprint arXiv:1806.09055, 2018

  45. [45]

    Regularized evolution for image classifier architecture search,

    E. Real, A. Aggarwal, Y . Huang, and Q. V . Le, “Regularized evolution for image classifier architecture search,” inProc. of ACM AAAI, 2019

  46. [46]

    Codl: efficient cpu-gpu co-execution for deep learning inference on mobile devices

    F. Jia, D. Zhang, T. Cao, S. Jiang, Y . Liu, J. Ren, and Y . Zhang, “Codl: efficient cpu-gpu co-execution for deep learning inference on mobile devices.” inProc. of ACM MobiSys, 2022

  47. [47]

    Serving{DNNs}like clockwork: Performance predictability from the bottom up,

    A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kaufmann, Y . Vig- fusson, and J. Mace, “Serving{DNNs}like clockwork: Performance predictability from the bottom up,” inProc. of USENIX OSDI, 2020

  48. [48]

    {SHEPHERD}: Serving{DNNs}in the wild,

    H. Zhang, Y . Tang, A. Khandelwal, and I. Stoica, “{SHEPHERD}: Serving{DNNs}in the wild,” inProc. of USENIX NSDI, 2023

  49. [49]

    A survey on recent os- level energy management techniques for mobile processing units,

    Y . G. Kim, J. Kong, and S. W. Chung, “A survey on recent os- level energy management techniques for mobile processing units,”IEEE Transactions on Parallel and Distributed Systems, 2018

  50. [50]

    A deep q-learning approach for dynamic management of heterogeneous processors,

    U. Gupta, S. K. Mandal, M. Mao, C. Chakrabarti, and U. Y . Ogras, “A deep q-learning approach for dynamic management of heterogeneous processors,”IEEE Computer Architecture Letters, 2019

  51. [51]

    Optimizing autonomous driving for safety: A human-centric approach with llm-enhanced rlhf,

    Y . Sun, N. Salami Pargoo, P. Jin, and J. Ortiz, “Optimizing autonomous driving for safety: A human-centric approach with llm-enhanced rlhf,” inProc. of ACM UbiComp, 2024

  52. [52]

    Soar: Design and deployment of a smart roadside infrastructure system for autonomous driving,

    S. Shi, N. Ling, Z. Jiang, X. Huang, Y . He, X. Zhao, B. Yang, C. Bian, J. Xia, Z. Yanet al., “Soar: Design and deployment of a smart roadside infrastructure system for autonomous driving,” inProc. of ACM MobiCom, 2024

  53. [53]

    Rl- gpt: Integrating reinforcement learning and code-as-policy,

    S. Liu, H. Yuan, M. Hu, Y . Li, Y . Chen, S. Liu, Z. Lu, and J. Jia, “Rl- gpt: Integrating reinforcement learning and code-as-policy,”Advances in Neural Information Processing Systems, 2024

  54. [54]

    A satellite-born server design with massive tiny chips towards in-space computing,

    M. Xu, L. Zhang, H. Li, R. Xing, and Q. Sun, “A satellite-born server design with massive tiny chips towards in-space computing,” inProc. of IEEE Satellite, 2022

  55. [55]

    Dronesense: Leveraging drones for sustainable urban-scale sensing of open parking spaces,

    D. Zhao, M. Cao, L. Ding, Q. Han, Y . Xing, and H. Ma, “Dronesense: Leveraging drones for sustainable urban-scale sensing of open parking spaces,” inProc. of IEEE INFOCOM, 2022

  56. [56]

    Integrated sensing and communication in uav swarms for cooperative multiple targets tracking,

    L. Zhou, S. Leng, Q. Wang, and Q. Liu, “Integrated sensing and communication in uav swarms for cooperative multiple targets tracking,” IEEE Transactions on Mobile Computing, 2022

  57. [57]

    Socnet: A lightweight and fine-grained object recognition network for satellite on-orbit computing,

    Y . Pang, Y . Zhang, Y . Wang, X. Wei, and B. Chen, “Socnet: A lightweight and fine-grained object recognition network for satellite on-orbit computing,”IEEE Transactions on Geoscience and Remote Sensing, 2022

  58. [58]

    Autodroid: Llm-powered task automation in android,

    H. Wen, Y . Li, G. Liu, S. Zhao, T. Yu, T. J.-J. Li, S. Jiang, Y . Liu, Y . Zhang, and Y . Liu, “Autodroid: Llm-powered task automation in android,” inProc. of ACM MobiCom, 2024

  59. [59]

    Llm-powered embodied intelligence for socially-aware robot navigation in human- robot interaction,

    X. Liu, A. Farid, T. Amano, H. Rizk, and H. Yamaguchi, “Llm-powered embodied intelligence for socially-aware robot navigation in human- robot interaction,” inProc. of ACM SSTD, 2025. 13