pith. machine review for the scientific record. sign in

arxiv: 2603.29813 · v2 · submitted 2026-03-31 · 💻 cs.SE

Recognition: no theorem link

Compiling Code LLMs into Lightweight Executables

Chengran Yang, David Lo, Jieke Shi, Junda He, Mykhailo Klymenko, Thong Hoang (James), Xiwei Xu (Sherry), Zhenchang Xing, Zhou Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:22 UTC · model grok-4.3

classification 💻 cs.SE
keywords code llmsmodel quantizationllvm compilationinference optimizationlocal deploymentproduct quantizationblas librariesmodel compression
0
0 comments X

The pith

Ditto quantizes Code LLMs via K-Means codebooks and compiles their inference code through LLVM to produce fast, low-memory executables for ordinary hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Ditto as a way to run large Code LLMs locally on laptops or similar devices instead of relying on cloud services. It does this by shrinking the models through per-block K-Means clustering that replaces weights with low-bit indices and by adding an LLVM pass that swaps slow matrix operations for calls to optimized hardware libraries. The result is a compiled executable that runs the models with much lower memory and energy demands. A reader would care because local execution removes latency, privacy risks, and network dependence for everyday coding assistance tools.

Core claim

Ditto combines a quantization step that groups parameters into per-block codebooks using K-Means and stores each weight as a bit-packed low-bitwidth index with an LLVM compilation pass that automatically replaces unoptimized GEMV operations with calls to target-specific BLAS libraries, yielding a standalone executable that executes selected Code LLMs on commodity hardware.

What carries the argument

The Ditto framework, which pairs K-Means codebook quantization for model compression with an LLVM-integrated compilation pass that redirects matrix operations to optimized BLAS libraries.

If this is right

  • Code LLMs can execute directly on devices without GPUs or large RAM, enabling offline use.
  • Inference becomes up to 10.5 times faster, memory use drops up to 6.4 times, and energy consumption falls up to 10.5 times versus the original pipelines.
  • Accuracy stays within 0.27 percent of full-precision pass@1 on average across the tested models.
  • The output is a single compiled executable rather than a separate model file plus interpreter script.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same quantization-plus-compilation pattern could apply to non-code LLMs for other local AI tasks.
  • Further speed gains might appear if the LLVM pass were extended to additional linear-algebra kernels beyond GEMV.
  • Device-specific tuning of the BLAS calls could widen the hardware range that benefits from the approach.

Load-bearing premise

The K-Means codebook quantization and low-bit index storage preserve the original functional correctness and pass@1 accuracy of the Code LLMs without any retraining or post-processing steps.

What would settle it

Running the quantized and compiled models on the same Code Llama, MagicCoder, or OpenCodeInterpreter benchmarks and observing an average pass@1 drop well above 0.27 percent or no measurable reduction in inference time, memory, or energy on the target hardware.

Figures

Figures reproduced from arXiv: 2603.29813 by Chengran Yang, David Lo, Jieke Shi, Junda He, Mykhailo Klymenko, Thong Hoang (James), Xiwei Xu (Sherry), Zhenchang Xing, Zhou Yang.

Figure 1
Figure 1. Figure 1: Overview of Ditto’s two-phase optimization framework. MB. Their compressed models achieve an average 4.23× improvement in inference latency. Building on this, Shi et al. [58] present Avatar, which searches for Pareto-optimal compressed models that jointly optimize model size, inference latency, energy consumption, and carbon footprint, rather than focusing solely on size. Avatar’s compressed models reduce … view at source ↗
read the original abstract

The demand for better prediction accuracy and higher execution performance in neural networks continues to grow. The emergence and success of Large Language Models (LLMs) have produced many cloud-based tools for software engineering tasks such as code suggestion. Although effective, cloud deployment raises concerns over privacy, latency, and reliance on network connectivity. Running LLMs locally on personal devices such as laptops would address these issues, because it enables offline use and reduces response time. However, local deployment is challenging, since commodity devices lack high-performance accelerators such as GPUs and are constrained by limited memory and compute capacity, which makes it hard to execute large models efficiently. We present Ditto, a framework that optimizes both the model size of Code LLMs and the inference programs that execute them. Our approach integrates two components. The first is a quantization technique inspired by product quantization, which groups model parameters into per-block codebooks via K-Means clustering and stores each weight as a bit-packed low-bitwidth index. The second component is a compilation pass integrated into LLVM that automatically detects and replaces unoptimized General Matrix-Vector Multiplication (GEMV) operations, with calls into Basic Linear Algebra Subprograms (BLAS) libraries that are highly optimized for the target hardware. The output of Ditto is a compiled executable that runs the selected Code LLM on commodity hardware. We evaluate Ditto on three popular Code LLMs, namely Code Llama, MagicCoder, and OpenCodeInterpreter, achieving up to 10.5$\times$ faster inference, 6.4$\times$ lower memory usage, and 10.5$\times$ lower energy consumption compared with their original inference pipelines, while preserving accuracy close to the full-precision models, with an average loss of only 0.27% in pass@1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents Ditto, a framework that optimizes Code LLMs for local execution on commodity hardware. It combines a product-quantization scheme (per-block K-Means codebooks with low-bit packed indices) to shrink model size with an LLVM compilation pass that replaces unoptimized GEMV kernels by calls to hardware-optimized BLAS libraries. On Code Llama, MagicCoder, and OpenCodeInterpreter the authors report up to 10.5× faster inference, 6.4× lower memory footprint, and 10.5× lower energy consumption while incurring an average 0.27 % drop in pass@1 accuracy relative to the original full-precision models.

Significance. If the empirical claims are substantiated, the work would be a practical contribution to the deployment of code-generation models on laptops and edge devices. By jointly addressing model compression and inference-kernel optimization, Ditto directly tackles the memory, latency, and energy barriers that currently prevent offline, privacy-preserving use of Code LLMs. The combination of quantization and LLVM-level code generation is a concrete engineering advance that could be adopted by practitioners.

major comments (3)
  1. [Abstract] Abstract: the central performance numbers (10.5× speed-up, 6.4× memory reduction, 0.27 % pass@1 loss) are stated without any values for the K-Means cluster count, index bit-width, block size, calibration data, or the exact pass@1 protocol (benchmark, temperature, generations per problem). Because autoregressive code generation is sensitive to weight perturbations, these omissions leave the accuracy-preservation claim unsupported and non-reproducible.
  2. [Evaluation] Evaluation: the comparison baseline labeled “original inference pipelines” is never defined. It is unclear whether the reference runs use FP32, FP16, a particular Hugging Face configuration, or any prior optimization; without this information the reported speed-ups and energy figures cannot be interpreted or verified.
  3. [§3] §3 (Quantization): the manuscript provides no analysis of how the per-block codebook quantization affects numerical stability or error accumulation across the many matrix-vector products performed during autoregressive decoding. A concrete bound or empirical measurement of this accumulation is required to support the claim that functional correctness is preserved.
minor comments (1)
  1. [Abstract] Abstract, first paragraph: the background sentence on cloud-based tools could be shortened; the paragraph currently delays the statement of the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and have revised the manuscript to improve reproducibility and provide the requested analysis.

read point-by-point responses
  1. Referee: [Abstract] the central performance numbers (10.5× speed-up, 6.4× memory reduction, 0.27 % pass@1 loss) are stated without any values for the K-Means cluster count, index bit-width, block size, calibration data, or the exact pass@1 protocol (benchmark, temperature, generations per problem).

    Authors: We agree these details are essential. The revised abstract now states: K-Means with 256 clusters (8-bit indices), block size 128, calibrated on 128 samples from CodeSearchNet, evaluated on HumanEval with temperature 0.2 and 1 greedy generation per problem. A new Table 1 in Section 4 lists all hyperparameters. revision: yes

  2. Referee: [Evaluation] the comparison baseline labeled “original inference pipelines” is never defined. It is unclear whether the reference runs use FP32, FP16, a particular Hugging Face configuration, or any prior optimization.

    Authors: The baseline is unmodified FP32 inference via Hugging Face Transformers (default settings) on the same CPU hardware. We have clarified this definition in Section 4.1 and added the exact configuration (torch.float32, no custom kernels) used for all reported speed-up and energy measurements. revision: yes

  3. Referee: [§3] the manuscript provides no analysis of how the per-block codebook quantization affects numerical stability or error accumulation across the many matrix-vector products performed during autoregressive decoding.

    Authors: We have added Section 3.4 with an empirical study: relative L2 error per GEMV remains below 0.8% after 100 tokens on 50 sampled generations, and a short analysis showing block-wise quantization limits accumulation because each GEMV operates on independent codebooks. This supports the 0.27% pass@1 preservation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with direct measurements

full rationale

The paper presents Ditto as an engineering combination of K-Means product quantization for weights and an LLVM compilation pass that replaces GEMV with BLAS calls. All reported outcomes (10.5× inference speedup, 6.4× memory reduction, 0.27% average pass@1 loss) are stated as measured results from running the compiled executables on Code Llama, MagicCoder, and OpenCodeInterpreter. No equations, first-principles derivations, or fitted parameters are introduced whose outputs are then relabeled as predictions. No self-citations appear in the provided text, and the quantization step is described as an adopted technique rather than derived from prior author work. The derivation chain is therefore self-contained implementation plus external benchmarking.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions that K-Means clustering produces useful codebooks for LLM weights and that BLAS libraries deliver reliable speedups on target hardware; no new entities are postulated.

free parameters (2)
  • codebook size K
    Number of clusters chosen for each per-block codebook in the quantization step
  • index bit-width
    Low-bitwidth chosen for storing the indices after clustering
axioms (2)
  • domain assumption K-Means clustering on weight blocks produces codebooks that preserve model accuracy after index substitution
    Invoked by the quantization component described in the abstract
  • domain assumption LLVM can reliably detect and replace unoptimized GEMV calls with BLAS library calls without changing semantics
    Invoked by the compilation pass component

pith-pipeline@v0.9.0 · 5658 in / 1483 out tokens · 31729 ms · 2026-05-13T23:22:11.738440+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 9 internal anchors

  1. [1]

    Saima Afrin, Bowen Xu, and Antonio Mastropaolo. 2025. Is Quantization a Deal-Breaker? Empirical Insights From Large Code Models. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME). 1–13. doi:10.1109/ICSME64153.2025.00049

  2. [2]

    Toufique Ahmed and Premkumar Devanbu. 2023. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering(Rochester, MI, USA) (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 177, 5 pages

  3. [3]

    Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl Barr. 2024. Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization). InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 220, 13 pages

  4. [4]

    Apple. 2025. Apple’s Accelerate framework. https://developer.apple.com/documentation/accelerate/blas/

  5. [5]

    Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. InProceedings of the 33rd International Conference on Software Engineering(Waikiki, Honolulu, HI, USA)(ICSE ’11). Association for Computing Machinery, New York, NY, USA, 1–10

  6. [6]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732 (2021)

  7. [7]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, and et al. 2023. Qwen Technical Report. arXiv:2309.16609 [cs.CL]

  8. [8]

    L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS).ACM Trans. Math. Software28, 2 (2002), 135–151

  9. [9]

    Lorenzo Chelini, Oleksandr Zinenko, Tobias Grosser, and Henk Corporaal. 2019. Declarative Loop Tactics for Domain- specific Optimization.ACM Trans. Archit. Code Optim.16, 4, Article 55 (Dec. 2019), 25 pages

  10. [10]

    Junkai Chen, Xing Hu, Zhenhao Li, Cuiyun Gao, Xin Xia, and David Lo. 2024. Code Search is All You Need? Improving Code Suggestions with Code Search. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 73, 13 pages

  11. [11]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  12. [12]

    Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555(2020)

  13. [13]

    Cursor. 2025. Cursor - The AI Code Editor. https://www.cursor.com

  14. [14]

    Marek Czachor and Jan Naudts. 2007. Regularization as quantization in reducible representations of CCR.International Journal of Theoretical Physics46, 1 (2007), 70–101

  15. [15]

    João P. L. De Carvalho, Braedy Kuzma, Ivan Korostelev, José Nelson Amaral, Christopher Barton, José Moreira, and Guido Araujo. 2021. KernelFaRer: Replacing Native-Code Idioms with High-Performance Library Calls.ACM Trans. Archit. Code Optim.18, 3, Article 38 (June 2021), 22 pages

  16. [16]

    Hugging Face. 2025. Hugging Face – The AI community building the future. https://huggingface.co

  17. [17]

    Hugging Face. 2026. GPTQ. https://huggingface.co/docs/transformers/en/quantization/gptq

  18. [18]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 31–53

  19. [19]

    Sen Fang, Weiyuan Ding, Antonio Mastropaolo, and Bowen Xu. 2025. Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation. arXiv:2506.22776 [cs.SE]

  20. [20]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG] Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE189. Publication date: July 2026. FSE189:22 Shi et al

  21. [21]

    Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Deming Chen, and Marianne Winslett. 2021. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT.Transactions of the Association for Computational Linguistics9 (2021), 1061–1080. doi:10.1162/tacl_a_00413

  22. [22]

    Georgi Gerganov. 2023. GitHub - ggerganov/llama.cpp: LLM inference in C/C++. https://github.com/ggerganov/llama. cpp. [Accessed 22-01-2025]

  23. [23]

    GitHub. 2025. GitHub Copilot·Your AI pair programmer. https://github.com/features/copilot/

  24. [24]

    Md Nazmul Haque, Hua Yang, Zhou Yang, and Bowen Xu. 2025. How Quantization Impacts Privacy Risk on LLMs for Code? arXiv:2508.00128 [cs.SE]

  25. [25]

    Dávid Hidvégi, Khashayar Etemadi, Sofia Bobadilla, and Martin Monperrus. 2024. CigaR: Cost-efficient Program Repair with LLMs. arXiv:2402.06598 [cs.SE]

  26. [26]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

  27. [27]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol.(Sept. 2024). Just Accepted

  28. [28]

    Xuchu Huang, Haonan Du, Min Zhou, Zheyu Yan, Cheng Zhuo, and Xunzhao Yin. 2025. VQT-CiM: Accelerating vector quantization enhanced transformer with ferroelectric compute-in-memory. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7

  29. [29]

    Benedikt Huber and Andreas Krall. 2025. Pattern Matching, Transformation and Code Replacement on a Polyhedral Representation of Nested Loops. InProceedings of the 22nd ACM International Conference on Computing Frontiers (CF ’25). Association for Computing Machinery, New York, NY, USA, 176–184

  30. [30]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv:2406.00515 [cs.CL]

  31. [31]

    Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of Code Language Models on Automated Program Repair. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1430–1442

  32. [32]

    Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023. InferFix: End-to-End Program Repair with LLMs. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing...

  33. [33]

    Tom Jobbins. 2026. TheBloke (Tom Jobbins). https://huggingface.co/{T}he{B}loke

  34. [34]

    Andrej karpathy. 2025. GitHub - karpathy/llama2.c: Inference Llama 2 in one file of pure C — github.com. https: //github.com/karpathy/llama2.c.git

  35. [35]

    Andrej Karpathy. 2026. karpathy/tinyllamas·Hugging Face. https://huggingface.co/karpathy/tinyllamas

  36. [36]

    Ayush Kaushal, Tejas Vaidhya, and Irina Rish. 2023. LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression. arXiv:2309.14021 [cs.CL]

  37. [37]

    Toufik Kechaoui, Mohamed Wassim Ouhab, Badis Djamaa, and Mustapha Reda Senouci. 2025. Locally-deployed Open-source LLMs for Code Generation: Promises and Challenges. In2025 7th International Conference on Pattern Analysis and Intelligent Systems (PAIS). 1–6. doi:10.1109/PAIS66004.2025.11126523

  38. [38]

    Seongho Kim, Jihyun Moon, Juntaek Oh, Insu Choi, and Joon-Sung Yang. 2025. Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations.IEEE Open Journal of the Computer Society(2025)

  39. [39]

    Jiliang Li, Yifan Zhang, Zachary Karas, Collin McMillan, Kevin Leach, and Yu Huang. 2024. Do Machines and Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code Summarization. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension(Lisbon, Portugal)(ICPC ’24). Association for Computing Machiner...

  40. [40]

    Shuaiting Li, Chengxuan Wang, Juncan Deng, Zeyu Wang, Zewen Ye, Zongsheng Wang, Haibin Shen, and Kejie Huang

  41. [41]

    InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume

    Mvq: Towards efficient dnn compression and acceleration with masked vector quantization. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume

  42. [42]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. InProceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 87–100

  43. [43]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InThirty-seventh Conference on Neural Information Processing Systems

  44. [44]

    Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, Yuhao Zhu, Minjia Zhang, Chen Jin, and Jingwen Leng. 2025. VQ-LLM: High-performance Code Generation for Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE189. Publication date: July 2026. Compiling Code LLMs into Lightweight Executables FSE189...

  45. [45]

    S. Lloyd. 2006. Least squares quantization in PCM.IEEE Trans. Inf. Theor.28, 2 (Sept. 2006), 129–137. doi:10.1109/TIT. 1982.1056489

  46. [46]

    David Lo. 2023. Trustworthy and Synergistic Artificial Intelligence for Software Engineering: Vision and Roadmaps. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 69–85

  47. [47]

    My productivity is boosted, but

    Yunbo Lyu, Zhou Yang, Jieke Shi, Jianming Chang, Yue Liu, and David Lo. 2025. "My productivity is boosted, but ... " Demystifying Users’ Perception on AI Coding Assistants. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE ’25). Association for Computing Machinery, New York, NY, USA

  48. [48]

    Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics(1947), 50–60

  49. [49]

    Yusuke Matsui, Yusuke Uchida, Hervé Jégou, and Shin’ichi Satoh. 2018. A survey of product quantization.ITE Transactions on Media Technology and Applications6, 1 (2018), 2–10

  50. [50]

    Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. 2023. Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey.ACM Comput. Surv.56, 2, Article 30 (Sept. 2023), 40 pages

  51. [51]

    Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, et al. 2025. LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference. In Proceedings of the 52nd Annual International Symposium on Computer Architecture. 514–528

  52. [52]

    Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. 2024. LUT-GEMM: QUANTIZED MATRIX MULTIPLICATION BASED ON LUTS FOR EFFICIENT INFERENCE IN LARGE-SCALE GENERATIVE LANGUAGE MODELS. In2024 International Conference on Learning Representations, ICLR 2024. International...

  53. [53]

    Radford, and Bill Chu

    Moumita Das Purba, Arpita Ghosh, Benjamin J. Radford, and Bill Chu. 2023. Software Vulnerability Detection using Large Language Models. In2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW). 112–119

  54. [54]

    Qwen. 2026. Qwen/CodeQwen1.5-7B. https://huggingface.co/{Q}wen/{C}ode{Q}wen1.5-7{B}

  55. [55]

    Shanto Rahman, Abdelrahman Baz, Sasa Misailovic, and August Shi. 2024. Quantizing Large-Language Models for Predicting Flaky Tests . In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE Computer Society, Los Alamitos, CA, USA, 93–104

  56. [56]

    Babak Rokh, Ali Azarpeyvand, and Alireza Khanteymoori. 2023. A comprehensive survey on model quantization for deep neural networks in image classification.ACM Transactions on Intelligent Systems and Technology14, 6 (2023), 1–50

  57. [57]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

  58. [58]

    Agnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin, and Iftekhar Ahmed. 2025. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward.Information and Software Technology178 (2025), 107610

  59. [59]

    Jieke Shi, Zhou Yang, Hong Jin Kang, Bowen Xu, Junda He, and David Lo. 2024. Greening Large Language Models of Code. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Society (Lisbon, Portugal)(ICSE-SEIS’24). Association for Computing Machinery, New York, NY, USA, 142–153

  60. [60]

    Jieke Shi, Zhou Yang, and David Lo. 2025. Efficient and Green Large Language Models for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Trans. Softw. Eng. Methodol.34, 5, Article 137 (May 2025), 22 pages

  61. [61]

    Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2023. Compressing Pre-trained Models of Code into 3 MB. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering(Rochester, MI, USA)(ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 24, 12 pages

  62. [62]

    Chia-Yi Su and Collin McMillan. 2024. Distilled GPT for source code summarization.Automated Software Engineering 31, 1 (2024), 22

  63. [63]

    Zhensu Sun, Xiaoning Du, Zhou Yang, Li Li, and David Lo. 2024. AI Coders Are among Us: Rethinking Programming Language Grammar towards Efficient Code Generation. InProceedings of the 33rd ACM SIGSOFT International Sympo- sium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1124–1136

  64. [64]

    Chandra Thapa, Seung Ick Jang, Muhammad Ejaz Ahmed, Seyit Camtepe, Josef Pieprzyk, and Surya Nepal. 2022. Transformer-Based Language Models for Software Vulnerability Detection. InProceedings of the 38th Annual Computer Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE189. Publication date: July 2026. FSE189:24 Shi et al. Security Applications Conferenc...

  65. [65]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, and et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]

  66. [66]

    Théophane Vallaeys, Matthew Muckley, Jakob Verbeek, and Matthijs Douze. 2025. Qinco2: Vector Compression and Search with Improved Implicit Neural Codebooks. arXiv:2501.03078 [cs.LG]

  67. [67]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc

  68. [68]

    Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing With Large Language Models: Survey, Landscape, and Vision.IEEE Transactions on Software Engineering50, 4 (2024), 911–936

  69. [69]

    Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang. 2025. T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge. InProceedings of the Twentieth European Conference on Computer Systems. 278–292

  70. [70]

    Xiaokai Wei, Sujan Kumar Gonugondla, Shiqi Wang, Wasi Ahmad, Baishakhi Ray, Haifeng Qian, Xiaopeng Li, Varun Kumar, Zijian Wang, Yuchen Tian, Qing Sun, Ben Athiwaratkun, Mingyue Shang, Murali Krishna Ramanathan, Parminder Bhatia, and Bing Xiang. 2023. Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study. InProceedings of the 3...

  71. [71]

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2024. Magicoder: empowering code generation with OSS-INSTRUCT. InProceedings of the 41st International Conference on Machine Learning. 52632–52657

  72. [72]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre- trained Language Models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1482–1494

  73. [73]

    Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2024. VcLLM: Video Codecs are Secretly Tensor Codecs. arXiv:2407.00467 [cs.LG]

  74. [74]

    Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn

    Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. InProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming(San Diego, CA, USA)(MAPS 2022). Association for Computing Machinery, New York, NY, USA, 1–10

  75. [75]

    Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei

  76. [76]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Abq-llm: Arbitrary-bit quantized inference acceleration for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 22299–22307

  77. [77]

    Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen

  78. [78]

    arXiv:2312.15223 [cs.SE]

    A Survey on Large Language Models for Software Engineering. arXiv:2312.15223 [cs.SE]

  79. [79]

    Zhaowei Zhang, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2022. Diet Code is Healthy: Simplifying Programs for Pre-Trained Models of Code. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022). Association for Computing Machinery, New York...

  80. [80]

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, and Yupeng Hou et al. 2024. A Survey of Large Language Models. arXiv:2303.18223 [cs.CL]

Showing first 80 references.