pith. sign in

arxiv: 2502.15761 · v3 · pith:6ZXG25DUnew · submitted 2025-02-13 · 💻 cs.DC · cs.AI· cs.GR· cs.HC

AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results

Pith reviewed 2026-05-23 03:16 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.GRcs.HC
keywords on-device LLMsXR devicesbenchmarking frameworkPareto optimalityperformance evaluationreal-time applicationsdevice-model pairs
0
0 comments X

The pith

AIvaluateXR applies 3D Pareto optimality to rank optimal LLM and XR device pairs from quality and speed objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AIvaluateXR, a framework for benchmarking large language models running directly on extended reality devices. It deploys 17 models on four platforms and records performance consistency, processing speed, memory usage, and battery consumption across string lengths, batch sizes, and thread counts. The central method uses 3D Pareto Optimality theory to identify the best model-device pairs that trade off quality against speed for real-time use. The work also contrasts on-device inference efficiency with client-server and cloud setups and checks accuracy on two interactive tasks. The findings aim to guide optimization of LLM deployments in XR.

Core claim

We propose a unified evaluation method based on the 3D Pareto Optimality theory to select the optimal device-model pairs from quality and speed objectives. The framework measures four metrics for each of the 68 model-device pairs and analyzes trade-offs for real-time XR applications, while also comparing on-device performance to client-server and cloud setups and evaluating accuracy on interactive tasks.

What carries the argument

3D Pareto Optimality theory applied to the four metrics of performance consistency, processing speed, memory usage, and battery consumption to rank model-device pairs.

If this is right

  • Optimal device-model pairs can be selected for real-time XR applications from the 68 combinations.
  • Efficiency of on-device LLMs can be directly compared against client-server and cloud-based setups.
  • Accuracy of the models can be assessed on two interactive tasks using the same evaluation setup.
  • The method can serve as standard groundwork for further research on LLM deployment on XR devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same four-metric approach could be applied to evaluate non-LLM AI models on XR hardware.
  • Hardware vendors might use the ranked pairs to target improvements in battery or memory for AI workloads.
  • Application developers could adopt the framework to decide when on-device inference meets their latency needs versus offloading.

Load-bearing premise

The four measured metrics are assumed to be sufficient to determine suitability for real-time XR applications.

What would settle it

A test showing that a pair ranked highest by the 3D Pareto method delivers unacceptable latency or instability during live interactive XR use outside the measured conditions.

Figures

Figures reproduced from arXiv: 2502.15761 by Alexandre Kouyoumdjian, Dawar Khan, Donggang Jia, Ivan Viola, Omar Mena, Xinyu Liu.

Figure 1
Figure 1. Figure 1: Overview of our research process. The rectangular boxes rep￾resent the procedures, while the ovals indicate the outcomes. We deployed 17 LLMs on four XR devices. First, we identified the key factors affecting performance. Based on these factors, we defined our evaluation criteria and conducted the experiments. Finally, we applied 3D Pareto analysis, which provided the Pareto front to highlight the best-per… view at source ↗
Figure 2
Figure 2. Figure 2: Model Quality Analysis: The models were evaluated based on six [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Consistency results of the four devices over time: PP (top) and TG (bottom) speeds in tokens per second across 20 sorted runs (X-axis: run number, Y-axis: speed in t/s). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Run 100 150 200 250 300 350 400 450 Performance Speed (t/s) Apple Vision Pro (GPU)- PP Results 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Run 10 15 20 25 30 Apple Vision Pro (GPU)-… view at source ↗
Figure 4
Figure 4. Figure 4: Consistency results of the AVP on GPU. Left: PP results. Right: TG results. Each model was tested 20 times, and the results are plotted in sorted order. consumption, while the other three devices did not show signifi￾cant variations with changing model sizes. Given this observation, instead of testing all 17 models, we conducted battery tests on selected models. Specifically, we chose m1 as the sole model … view at source ↗
Figure 5
Figure 5. Figure 5: Processing speed of the four devices in the PP test (top) and TG test (bottom) with varying string lengths: 64, 128, 256, 512, and 1024. The x-axis represents the model, while the y-axis represents the processing speed in t/s. For m1, a different scale is used due to its higher values. 5.1 Model Quality Analysis As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Batch Test results: Batch size (X-axis) vs. processing speed (Y-axis). 1 4 8 16 32 Threads 0 5 10 15 20 25 30 Processing Speed (t/s) Magic Leap 2 1 4 8 16 32 Threads 0 5 10 15 20 25 30 Meta Quest 3 1 4 8 16 32 Threads 0 5 10 15 20 25 30 Vivo X100 Pro 1 4 8 16 32 Threads 0 5 10 15 20 25 30 Apple Vision Pro (CPU) 1 4 8 16 32 Threads 70 90 110 130 150 170 190 210 230 Apple Vision Pro (GPU) m2 m3 m4 m5 m6 m7 m… view at source ↗
Figure 7
Figure 7. Figure 7: Thread Test (TT) results: The X-axis represents the thread count, while the Y-axis shows the speed in tokens per second. Threads 4 and 8 deliver the fastest results. Note: AVP (CPU) fails for thread counts of 16 and 32. significantly lower processing speeds, with higher variability and instability, especially in TG. Model size also impacts processing speed, as smaller models are faster. Here, m1 is the fas… view at source ↗
Figure 8
Figure 8. Figure 8: Error counts for Meta Quest 3 (left) and AVP (right) in PP (top) and TG (bottom). Magic Leap 2, with zero errors, and Vivo X100 Pro, with a total of four errors, one in m1 (PP-64), two in m10 (PP-64, PP￾128), and one in m11 (TG-1024) are excluded. overhead. AVP (GPU), with its strong computational resources, does not show a significant performance drop with varying batch sizes, although there are some fluc… view at source ↗
Figure 9
Figure 9. Figure 9: Memory consumption for each model-device pair. The results represent the mean values across batch sizes 128, 256, 512, and 1024. 512, and 1024. As expected, memory consumption varies across models, with most showing consistent usage across different de￾vices. Generally, within each model series, memory consumption increases with model size. However, minor exceptions exist, such as m7 on AVP (GPU), which co… view at source ↗
Figure 11
Figure 11. Figure 11: Pareto fronts for various device-model pairs, with different col￾ors representing distinct device types. Performance is evaluated based on processing speed, memory consumption, and battery consumption. Stability is assessed using the CV and error count values of PP and TG. Quality is measured through evaluation results on six benchmark datasets. 5.7 Pareto Front Results Since we use both CPU and GPU for i… view at source ↗
read the original abstract

The deployment of large language models (LLMs) on extended reality (XR) devices has great potential to advance the field of human-AI interaction. In the case of direct, on-device model inference, selecting the appropriate model and device for specific tasks remains challenging. In this paper, we present AIvaluateXR, a comprehensive evaluation framework for benchmarking LLMs running on XR devices. To demonstrate the framework, we deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation. Our experimental setup measures four key metrics: performance consistency, processing speed, memory usage, and battery consumption. For each of the 68 model-device pairs, we assess performance under varying string lengths, batch sizes, and thread counts, analyzing the trade-offs for real-time XR applications. We propose a unified evaluation method based on the 3D Pareto Optimality theory to select the optimal device-model pairs from quality and speed objectives. Additionally, we compare the efficiency of on-device LLMs with client-server and cloud-based setups, and evaluate their accuracy on two interactive tasks. We believe our findings offer valuable insight to guide future optimization efforts for LLM deployment on XR devices. Our evaluation method can be used as standard groundwork for further research and development in this emerging field. The source code and supplementary materials are available at: www.nanovis.org/AIvaluateXR.html

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents AIvaluateXR, a benchmarking framework for on-device LLMs in XR. It evaluates 17 models across four platforms (Magic Leap 2, Meta Quest 3, Vivo X100s Pro, Apple Vision Pro) using four metrics—performance consistency, processing speed, memory usage, and battery consumption—under varying string lengths, batch sizes, and thread counts. The authors propose a 3D Pareto Optimality method to select optimal device-model pairs from quality and speed objectives, compare on-device inference to client-server and cloud setups, and report accuracy on two interactive tasks. Source code is made available.

Significance. If the 3D Pareto selection method and benchmarking results are robust, the work supplies a practical, open-source framework for identifying suitable on-device LLM configurations for real-time XR, addressing a gap in human-AI interaction tooling. The explicit comparison to non-on-device baselines and the release of code and materials add reproducibility value.

major comments (1)
  1. [Abstract] Abstract: The central claim of a 'unified evaluation method based on the 3D Pareto Optimality theory to select the optimal device-model pairs from quality and speed objectives' is load-bearing, yet the four reported metrics are exclusively efficiency-oriented while accuracy on the two interactive tasks is described separately. If the Pareto surface is constructed only from the efficiency quartet, the quality objective is not demonstrably represented in the three-dimensional optimality criterion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of a 'unified evaluation method based on the 3D Pareto Optimality theory to select the optimal device-model pairs from quality and speed objectives' is load-bearing, yet the four reported metrics are exclusively efficiency-oriented while accuracy on the two interactive tasks is described separately. If the Pareto surface is constructed only from the efficiency quartet, the quality objective is not demonstrably represented in the three-dimensional optimality criterion.

    Authors: We agree that the abstract phrasing is imprecise and creates the impression that accuracy (task-level quality) is integrated into the 3D Pareto criterion. In the manuscript the Pareto surface is in fact constructed from three efficiency metrics (processing speed, memory usage, battery consumption), with performance consistency used only as a post-filter for stable runs; the two interactive-task accuracy results are reported separately and are not part of the optimality selection. We will revise the abstract (and the corresponding sentence in Section 4) to state that the method selects pairs according to efficiency objectives, while accuracy is evaluated independently. The revision will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking applies standard Pareto to measured data

full rationale

The paper is an empirical benchmarking study that measures four efficiency metrics across 68 model-device pairs under varying conditions and then applies the established 3D Pareto Optimality concept (a known external mathematical framework) to select pairs based on the observed trade-offs. No equations, fitted parameters, or derivations are presented that reduce any claimed result to its own inputs by construction. Accuracy on interactive tasks is evaluated separately and not folded into any self-referential optimality surface. No self-citations are load-bearing for the central method, and the work remains self-contained against external benchmarks without renaming known results or smuggling ansatzes. This matches the default non-circular outcome for straightforward empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmarking study. No free parameters are fitted to produce the central result. No new entities are postulated. The main assumption is that the chosen metrics adequately represent real-time XR suitability.

axioms (1)
  • domain assumption The four key metrics (performance consistency, processing speed, memory usage, battery consumption) sufficiently capture trade-offs for real-time XR applications.
    This premise underpins the experimental design and the claim that the Pareto method selects optimal pairs.

pith-pipeline@v0.9.0 · 5822 in / 1294 out tokens · 23882 ms · 2026-05-23T03:16:29.437363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

    cs.CV 2026-04 unverdicted novelty 5.0

    ClickAIXR combines controller-based object selection in XR with on-device VLM inference to enable private, precise multimodal queries about real objects.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    When XR and AI meet - a scoping review on extended reality and artificial intelligence,

    T. Hirzle, F. M ¨uller, F. Draxler, M. Schmitz, P. Knierim, and K. Hornbæk, “When XR and AI meet - a scoping review on extended reality and artificial intelligence,” in Proc. of the 2023 CHI Conference on Human Factors in Computing Systems, ser. CHI ’23. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10...

  2. [2]

    A review on edge large language models: Design, execution, and applications,

    Y . Zheng, Y . Chen, B. Qian, X. Shi, Y . Shu, and J. Chen, “A review on edge large language models: Design, execution, and applications,” ACM Comput. Surv. , vol. 57, no. 8, Mar. 2025. [Online]. Available: https://doi.org/10.1145/3719664

  3. [3]

    LLM integration in extended reality: A comprehensive review of current trends, challenges, and future perspectives,

    Y . Tang, J. Situ, A. Y . Cui, M. Wu, and Y . Huang, “LLM integration in extended reality: A comprehensive review of current trends, challenges, and future perspectives,” in Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , ser. CHI ’25. New York, NY , USA: Association for Computing Machinery, 2025. [Online]. Available: https:...

  4. [4]

    Augmented reality and artificial intelligence in industry: Trends, tools, and future challenges,

    J. S. Devagiri, S. Paheding, Q. Niyaz, X. Yang, and S. Smith, “Augmented reality and artificial intelligence in industry: Trends, tools, and future challenges,” Expert Systems with Applications , vol. 207, p. 118002, 2022. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0957417422012246

  5. [5]

    Embedding large language models into extended reality: Opportunities and challenges for inclusion, engagement, and privacy,

    E. Bozkir, S. ¨Ozdel, K. H. C. Lau, M. Wang, H. Gao, and E. Kasneci, “Embedding large language models into extended reality: Opportunities and challenges for inclusion, engagement, and privacy,” in ACM Conference on Conversational User Interfaces , ser. CUI ’24. New York, NY , USA: ACM, 2024. [Online]. Available: https://doi.org/10.1145/3640794.3665563

  6. [6]

    Everyday AR through AI-in-the-loop,

    R. Suzuki, M. Gonzalez-Franco, M. Sra, and D. Lindlbauer, “Everyday AR through AI-in-the-loop,” in Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , ser. CHI EA ’25. New York, NY , USA: Association for Computing Machinery,

  7. [7]

    Available: https://doi.org/10.1145/3706599.3706741

    [Online]. Available: https://doi.org/10.1145/3706599.3706741

  8. [8]

    A brief overview of ChatGPT: The history, status quo and potential future development,

    T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q.-L. Han, and Y . Tang, “A brief overview of ChatGPT: The history, status quo and potential future development,” IEEE/CAA Journal of Automatica Sinica , vol. 10, no. 5, pp. 1122–1136, 2023

  9. [9]

    A comparison of grammatical error correction models in english writ- ing,

    K. Eker, M. K. Pehlivano ˘glu, A. G. Eker, M. A. Syakura, and N. Duru, “A comparison of grammatical error correction models in english writ- ing,” in 2023 8th International Conference on Computer Science and Engineering (UBMK), 2023, pp. 218–223

  10. [10]

    The comparison of two automated feedback approaches based on automated analysis of the online asynchronous interaction: a case of massive online teacher training,

    I. Adeshola and A. P. Adepoju, “The opportunities and challenges of ChatGPT in education,” Interactive Learning Environments , vol. 0, no. 0, pp. 1–14, 2023. [Online]. Available: https: //doi.org/10.1080/10494820.2023.2253858

  11. [11]

    Investigating code generation performance of ChatGPT with crowd- sourcing social data,

    Y . Feng, S. Vanam, M. Cherukupally, W. Zheng, M. Qiu, and H. Chen, “Investigating code generation performance of ChatGPT with crowd- sourcing social data,” in 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), 2023, pp. 876–885

  12. [12]

    Open3DSG: Open-vocabulary 3D scene graphs from point clouds with queryable objects and open-set relationships,

    S. Koch, N. Vaskevicius, M. Colosi, P. Hermosilla, and T. Ropinski, “Open3DSG: Open-vocabulary 3D scene graphs from point clouds with queryable objects and open-set relationships,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 183–14 193

  13. [13]

    VOICE: Visual oracle for interac- tion, conversation, and explanation,

    D. Jia, A. Irger, L. Besanc ¸on, O. Strnad, D. Luo, J. Bj ¨orklund, A. Kouy- oumdjian, A. Ynnerman, and I. Viola, “VOICE: Visual oracle for interac- tion, conversation, and explanation,” IEEE Transactions on Visualization and Computer Graphics, pp. 1–18, 2025

  14. [14]

    Chat modeling: Natural language-based procedural modeling of biological structures without training,

    D. Jia, Y . Wang, and I. Viola, “Chat modeling: Natural language-based procedural modeling of biological structures without training,” 2024

  15. [15]

    XaiR: An XR platform that integrates large language models with the physical world,

    S. Srinidhi, E. Lu, and A. Rowe, “XaiR: An XR platform that integrates large language models with the physical world,” in IEEE Int. Symposium on Mixed and Augmented Reality (ISMAR) , 2024, pp. 759–767

  16. [16]

    Mo- bileLLM: Optimizing sub-billion parameter language models for on- device use cases,

    Z. Liu, C. Zhao, F. Iandola, C. Lai, Y . Tian, I. Fedorov, Y . Xiong, E. Chang, Y . Shi, R. Krishnamoorthi, L. Lai, and V . Chandra, “Mo- bileLLM: Optimizing sub-billion parameter language models for on- device use cases,” in Proceedings of ICML, 2024, pp. 32 431–32 454

  17. [17]

    Immersive tailoring of embodied agents using large language models,

    A. Bellucci, G. Jacucci, K. D. Trung, P. K. Das, S. V . Smirnov, I. Ahmed, and J.-L. Lugrin, “Immersive tailoring of embodied agents using large language models,” in2025 IEEE Conference Virtual Reality and 3D User Interfaces (VR), 2025, pp. 392–400

  18. [18]

    LLMER: Crafting interactive extended reality worlds with json data generated by large language models,

    J. Chen, X. Wu, T. Lan, and B. Li, “LLMER: Crafting interactive extended reality worlds with json data generated by large language models,” IEEE Transactions on Visualization and Computer Graphics , vol. 31, no. 5, p. 2715–2724, Mar. 2025. [Online]. Available: https://doi.org/10.1109/TVCG.2025.3549549

  19. [19]

    Next generation XR systems—large language models meet augmented and virtual reality,

    M. Z. Afzal, S. A. Ali, D. Stricker, P. Eisert, A. Hilsmann, D. Perez- Marcos, M. Bianchi, S. Crottaz-Herbette, R. De Ioris, E. Mangina, M. Sanguineti, A. Salaberria, O. L. de Lacalle, A. Garc ´ıa-Pablos, and M. Cuadros, “Next generation XR systems—large language models meet augmented and virtual reality,” IEEE Computer Graphics and Applications, vol. 45,...

  20. [20]

    Teachers’ vocal expressions and student engagement in asynchronous video learning,

    H. Gao, Y . Xie, and E. K. and, “PerVRML: ChatGPT-driven personalized VR environments for machine learning education,”International Journal of Human–Computer Interaction, vol. 0, no. 0, pp. 1–15, 2025. [Online]. Available: https://doi.org/10.1080/10447318.2025.2504188

  21. [21]

    Building llm-based AI agents in social virtual reality,

    H. Wan, J. Zhang, A. A. Suria, B. Yao, D. Wang, Y . Coady, and M. Prpa, “Building llm-based AI agents in social virtual reality,” in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , ser. CHI EA ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3613905.3651026

  22. [22]

    Exploring large language model- driven agents for environment-aware spatial interactions and conversa- tions in virtual reality role-play scenarios,

    Z. Li, H. Zhang, C. Peng, and R. Peiris, “Exploring large language model- driven agents for environment-aware spatial interactions and conversa- tions in virtual reality role-play scenarios,” in 2025 IEEE Conference Virtual Reality and 3D User Interfaces (VR), 2025, pp. 1–11

  23. [23]

    Generative role-play communication training in virtual reality for autistic individuals: A study on job coach experiences in vocational training programs,

    Z. Li, P. P. Babar, and R. L. Peiris, “Generative role-play communication training in virtual reality for autistic individuals: A study on job coach experiences in vocational training programs,” in Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, ser. CHI ’25. New York, NY , USA: Association for Computing Machinery, 2025. [Onl...

  24. [24]

    Interactive augmented reality storytelling guided by scene semantics,

    C. Li, W. Li, H. Huang, and L.-F. Yu, “Interactive augmented reality storytelling guided by scene semantics,” ACM Trans. Graph., vol. 41, no. 4, Jul. 2022. [Online]. Available: https: //doi.org/10.1145/3528223.3530061

  25. [25]

    Object-driven narrative in AR: A scenario-metaphor framework with VLM integration,

    Y . Sun, H. Guan, leith Kin Yep Chan, and Y . H. Kuo, “Object-driven narrative in AR: A scenario-metaphor framework with VLM integration,”

  26. [26]

    Available: https://arxiv.org/abs/2504.13119

    [Online]. Available: https://arxiv.org/abs/2504.13119

  27. [27]

    A large model’s ability to identify 3d objects as a function of viewing angle,

    J. Rubinstein, F. Ferraro, C. Matuszek, and D. Engel, “A large model’s ability to identify 3d objects as a function of viewing angle,” in 2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), 2024, pp. 281–288

  28. [28]

    LLMR: Real-time prompting of interactive worlds using large language models,

    F. De La Torre, C. M. Fang, H. Huang, A. Banburski-Fahey, J. Amores Fernandez, and J. Lanier, “LLMR: Real-time prompting of interactive worlds using large language models,” in Proc. of the CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–22

  29. [29]

    MagicItem: Dynamic behavior design of virtual objects with large language models in a commercial metaverse platform,

    R. Kurai, T. Hiraki, Y . Hiroi, Y . Hirao, M. Perusqu ´ıa-Hern´andez, H. Uchiyama, and K. Kiyokawa, “MagicItem: Dynamic behavior design of virtual objects with large language models in a commercial metaverse platform,” IEEE Access, vol. 13, pp. 19 132–19 143, 2025

  30. [30]

    Odoragent: Generate odor sequences for movies based on large language model,

    Y . Zhang, P. Gao, F. Kang, J. Li, J. Liu, Q. Lu, and Y . Xu, “Odoragent: Generate odor sequences for movies based on large language model,” in 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR) . IEEE, 2024, pp. 105–114

  31. [31]

    Text2vrscene: Exploring the framework of automated text-driven generation system for vr experi- ence,

    Z. Yin, Y . Wang, T. Papatheodorou, and P. Hui, “Text2vrscene: Exploring the framework of automated text-driven generation system for vr experi- ence,” in 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR). IEEE, 2024, pp. 701–711

  32. [32]

    Supporting text entry in virtual reality with large language models,

    L. Chen, Y . Cai, R. Wang, S. Ding, Y . Tang, P. Hansen, and L. Sun, “Supporting text entry in virtual reality with large language models,” in 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR) . IEEE, 2024, pp. 524–534

  33. [33]

    Dreamcodevr: Towards democratizing behavior design in virtual reality with speech-driven programming,

    D. Giunchi, N. Numan, E. Gatti, and A. Steed, “Dreamcodevr: Towards democratizing behavior design in virtual reality with speech-driven programming,” in 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR). IEEE, 2024, pp. 579–589

  34. [34]

    Building llm-based ai agents in social virtual reality,

    H. Wan, J. Zhang, A. A. Suria, B. Yao, D. Wang, Y . Coady, and M. Prpa, “Building llm-based ai agents in social virtual reality,” in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–7

  35. [35]

    A web application with multi-input capabilities for AI-driven 3D object generation, designed for XR applications,

    M. S. Kawka, A. Nikopoulos, P. Bikiris, I. Kalisperakis, and C. Sten- toumis, “A web application with multi-input capabilities for AI-driven 3D object generation, designed for XR applications,” in 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), 2025, pp. 447–454

  36. [36]

    Character animation pipeline based on latent diffusion and large language models,

    A. Clocchiatti, N. Fumer `o, and A. M. Soccini, “Character animation pipeline based on latent diffusion and large language models,” in 2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), 2024, pp. 398–405

  37. [37]

    Instant3D: Instant text-to-3d generation,

    M. Li, P. Zhou, J.-W. Liu, J. Keppo, M. Lin, S. Yan, and X. Xu, “Instant3D: Instant text-to-3d generation,” Int. J. Comput. Vision , JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14 vol. 132, no. 10, p. 4456–4472, May 2024. [Online]. Available: https://doi.org/10.1007/s11263-024-02097-5

  38. [38]

    Dream mesh: A speech- to-3D model generative pipeline in mixed reality,

    S. C.-C. Weng, Y .-M. Chiou, and E. Y .-L. Do, “Dream mesh: A speech- to-3D model generative pipeline in mixed reality,” in 2024 IEEE Inter- national Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), 2024, pp. 345–349

  39. [39]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick, “ Zero-1-to-3: Zero-shot One Image to 3D Object ,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2023, pp. 9264–9275. [Online]. Available: https://doi.ieeecomputersociety.org/10. 1109/ICCV51070.2023.00853

  40. [40]

    MDD: masked deconstructed diffusion for 3D human motion generation from text,

    J. Chen, F. Liu, and Y . Wang, “MDD: masked deconstructed diffusion for 3D human motion generation from text,” in 2025 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), 2025, pp. 61–72

  41. [41]

    Generative ai for context- aware 3D object creation using vision-language models in augmented reality,

    M. Behravan, K. Matkovi ´c, and D. Gra ˇcanin, “Generative ai for context- aware 3D object creation using vision-language models in augmented reality,” in2025 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), 2025, pp. 73–81

  42. [42]

    Mobile edge intelligence for large language models: A contemporary survey,

    G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, and K. Huang, “Mobile edge intelligence for large language models: A contemporary survey,” arXiv preprint arXiv:2407.18921, 2024

  43. [43]

    Opti- mize weight rounding via signed gradient descent for the quantization of llms,

    W. Cheng, W. Zhang, H. Shen, Y . Cai, X. He, K. Lv, and Y . Liu, “Opti- mize weight rounding via signed gradient descent for the quantization of llms,” arXiv preprint arXiv:2309.05516, 2023

  44. [44]

    Llm-pruner: On the structural pruning of large language models,

    X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,” Advances in neural information processing systems, vol. 36, pp. 21 702–21 720, 2023

  45. [45]

    Minillm: Knowledge distillation of large language models,

    Y . Gu, L. Dong, F. Wei, and M. Huang, “Minillm: Knowledge distillation of large language models,” in The Twelfth International Conference on Learning Representations, 2024

  46. [46]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu, “Kivi: A tuning-free asymmetric 2bit quantization for kv cache,” arXiv preprint arXiv:2402.02750, 2024

  47. [47]

    Mobilellm: Optimizing sub-billion parameter language models for on-device use cases,

    Z. Liu, C. Zhao, F. Iandola, C. Lai, Y . Tian, I. Fedorov, Y . Xiong, E. Chang, Y . Shi, R. Krishnamoorthi et al. , “Mobilellm: Optimizing sub-billion parameter language models for on-device use cases,” arXiv preprint arXiv:2402.14905, 2024

  48. [48]

    Introducing llama: A foundational, 65-billion-parameter large language model,

    AI-Meta, “Introducing llama: A foundational, 65-billion-parameter large language model,” Meta AI, 2023

  49. [49]

    Hugging Face: The AI community building the future. www. huggingface.co,

    H. Face, “Hugging Face: The AI community building the future. www. huggingface.co,” 2024, (Accessed: Sep. 10, 2024)

  50. [50]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “Hellaswag: Can a machine really finish your sentence?” 2019. [Online]. Available: https://arxiv.org/abs/1905.07830

  51. [51]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020

  52. [52]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” 2018. [Online]. Available: https://arxiv.org/abs/1803.05457

  53. [53]

    Truthfulqa: Measuring how models mimic human falsehoods,

    S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” 2021

  54. [54]

    Winogrande: an adversarial winograd schema challenge at scale

    K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi, “Winogrande: an adversarial winograd schema challenge at scale,” Commun. ACM, vol. 64, no. 9, p. 99–106, Aug. 2021. [Online]. Available: https://doi.org/10.1145/3474381

  55. [55]

    Pointer Sentinel Mixture Models

    S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” arXiv preprint arXiv:1609.07843, 2016

  56. [56]

    A survey of recent developments in multiobjective optimization,

    A. Chinchuluun and P. M. Pardalos, “A survey of recent developments in multiobjective optimization,” Annals of Operations Research , vol. 154, no. 1, pp. 29–50, 2007

  57. [57]

    Approaches for multi-objective optimization in the ecodesign of electric systems,

    S. Brisset and F. Gillon, “Approaches for multi-objective optimization in the ecodesign of electric systems,” Eco-friendly innovation in electricity transmission and distribution networks, pp. 83–97, 2015

  58. [58]

    Design- space exploration of pareto-optimal architectures for deep learning with dvfs,

    G. Santoro, M. R. Casu, V . Peluso, A. Calimera, and M. Alioto, “Design- space exploration of pareto-optimal architectures for deep learning with dvfs,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2018, pp. 1–5

  59. [59]

    Profiling apple silicon performance for ml training,

    D. Feng, Z. Xu, R. Wang, and F. X. Lin, “Profiling apple silicon performance for ml training,” 2025. [Online]. Available: https://arxiv.org/abs/2501.14925

  60. [60]

    Augmenting a large language model with a combination of text and visual data for conversational visualization of global geospatial data,

    O. Mena, A. Kouyoumdjian, L. Besanc ¸on, M. Gleicher, I. Viola, and A. Ynnerman, “Augmenting a large language model with a combination of text and visual data for conversational visualization of global geospatial data,” 2025. [Online]. Available: https://arxiv.org/abs/2501. 09521

  61. [61]

    Amdahl’s law,

    J. L. Gustafson, “Amdahl’s law,” inEncyclopedia of Parallel Computing. Springer, 2011, pp. 53–60. Dawar Khan received his Ph.D. from the NLPR, Institute of Automation, University of Chinese Academy of Sciences, Beijing, China, in 2018. He served as an Assistant Professor at the Nara Institute of Science and Technology (NAIST), Japan, from 2018 to 2020, an...