arxiv: 2603.29813 · v2 · submitted 2026-03-31 · 💻 cs.SE

Recognition: no theorem link

Compiling Code LLMs into Lightweight Executables

Chengran Yang, David Lo, Jieke Shi, Junda He, Mykhailo Klymenko, Thong Hoang (James), Xiwei Xu (Sherry), Zhenchang Xing, Zhou Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:22 UTC · model grok-4.3

classification 💻 cs.SE

keywords code llmsmodel quantizationllvm compilationinference optimizationlocal deploymentproduct quantizationblas librariesmodel compression

0 comments

The pith

Ditto quantizes Code LLMs via K-Means codebooks and compiles their inference code through LLVM to produce fast, low-memory executables for ordinary hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Ditto as a way to run large Code LLMs locally on laptops or similar devices instead of relying on cloud services. It does this by shrinking the models through per-block K-Means clustering that replaces weights with low-bit indices and by adding an LLVM pass that swaps slow matrix operations for calls to optimized hardware libraries. The result is a compiled executable that runs the models with much lower memory and energy demands. A reader would care because local execution removes latency, privacy risks, and network dependence for everyday coding assistance tools.

Core claim

Ditto combines a quantization step that groups parameters into per-block codebooks using K-Means and stores each weight as a bit-packed low-bitwidth index with an LLVM compilation pass that automatically replaces unoptimized GEMV operations with calls to target-specific BLAS libraries, yielding a standalone executable that executes selected Code LLMs on commodity hardware.

What carries the argument

The Ditto framework, which pairs K-Means codebook quantization for model compression with an LLVM-integrated compilation pass that redirects matrix operations to optimized BLAS libraries.

If this is right

Code LLMs can execute directly on devices without GPUs or large RAM, enabling offline use.
Inference becomes up to 10.5 times faster, memory use drops up to 6.4 times, and energy consumption falls up to 10.5 times versus the original pipelines.
Accuracy stays within 0.27 percent of full-precision pass@1 on average across the tested models.
The output is a single compiled executable rather than a separate model file plus interpreter script.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same quantization-plus-compilation pattern could apply to non-code LLMs for other local AI tasks.
Further speed gains might appear if the LLVM pass were extended to additional linear-algebra kernels beyond GEMV.
Device-specific tuning of the BLAS calls could widen the hardware range that benefits from the approach.

Load-bearing premise

The K-Means codebook quantization and low-bit index storage preserve the original functional correctness and pass@1 accuracy of the Code LLMs without any retraining or post-processing steps.

What would settle it

Running the quantized and compiled models on the same Code Llama, MagicCoder, or OpenCodeInterpreter benchmarks and observing an average pass@1 drop well above 0.27 percent or no measurable reduction in inference time, memory, or energy on the target hardware.

Figures

Figures reproduced from arXiv: 2603.29813 by Chengran Yang, David Lo, Jieke Shi, Junda He, Mykhailo Klymenko, Thong Hoang (James), Xiwei Xu (Sherry), Zhenchang Xing, Zhou Yang.

**Figure 1.** Figure 1: Overview of Ditto’s two-phase optimization framework. MB. Their compressed models achieve an average 4.23× improvement in inference latency. Building on this, Shi et al. [58] present Avatar, which searches for Pareto-optimal compressed models that jointly optimize model size, inference latency, energy consumption, and carbon footprint, rather than focusing solely on size. Avatar’s compressed models reduce … view at source ↗

read the original abstract

The demand for better prediction accuracy and higher execution performance in neural networks continues to grow. The emergence and success of Large Language Models (LLMs) have produced many cloud-based tools for software engineering tasks such as code suggestion. Although effective, cloud deployment raises concerns over privacy, latency, and reliance on network connectivity. Running LLMs locally on personal devices such as laptops would address these issues, because it enables offline use and reduces response time. However, local deployment is challenging, since commodity devices lack high-performance accelerators such as GPUs and are constrained by limited memory and compute capacity, which makes it hard to execute large models efficiently. We present Ditto, a framework that optimizes both the model size of Code LLMs and the inference programs that execute them. Our approach integrates two components. The first is a quantization technique inspired by product quantization, which groups model parameters into per-block codebooks via K-Means clustering and stores each weight as a bit-packed low-bitwidth index. The second component is a compilation pass integrated into LLVM that automatically detects and replaces unoptimized General Matrix-Vector Multiplication (GEMV) operations, with calls into Basic Linear Algebra Subprograms (BLAS) libraries that are highly optimized for the target hardware. The output of Ditto is a compiled executable that runs the selected Code LLM on commodity hardware. We evaluate Ditto on three popular Code LLMs, namely Code Llama, MagicCoder, and OpenCodeInterpreter, achieving up to 10.5$\times$ faster inference, 6.4$\times$ lower memory usage, and 10.5$\times$ lower energy consumption compared with their original inference pipelines, while preserving accuracy close to the full-precision models, with an average loss of only 0.27% in pass@1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ditto pairs product quantization on code LLM weights with an LLVM pass that swaps GEMV for BLAS calls, delivering large speed and memory gains on three models with only 0.27% average pass@1 loss.

read the letter

Ditto is a framework that shrinks Code LLMs using product quantization and then compiles their inference code with an LLVM pass that swaps GEMV for optimized BLAS calls. The result is executables that run much faster and use less memory on ordinary hardware, with only a tiny hit to pass@1 accuracy. The main takeaway is that this combination makes local, offline code assistance more realistic on laptops without GPUs. The numbers they report are up to 10.5 times faster inference, 6.4 times lower memory, and the same factor for energy use, all while keeping accuracy close to the original models on Code Llama, MagicCoder, and OpenCodeInterpreter. The 0.27 percent average drop in pass@1 is the key claim that makes the work interesting if it holds. What is actually new is the integrated application to code generation models. Product quantization with K-Means codebooks and low-bit indices has been used elsewhere, and LLVM passes for linear algebra are standard, but putting them together specifically for Code LLMs and showing end-to-end executable results on real coding benchmarks is a useful incremental step. The paper does well at framing the deployment problem around privacy, latency, and connectivity, and it gives concrete multi-metric results instead of stopping at compression ratios. The soft spots are in the missing experimental details. The abstract states the headline gains but does not report the codebook size K, the index bit width, block size, calibration data, or the exact pass@1 protocol including temperature and generations per problem. Because autoregressive code generation is sensitive to weight changes, these omissions make it hard to judge whether the small accuracy loss is robust or tied to particular hyperparameter choices. The stress-test note correctly flags this gap. If the full paper supplies those parameters plus ablations and baseline comparisons, the claims become much stronger. This paper is for researchers and engineers working on local deployment of code LLMs or on practical quantization and compilation for transformers. A reader who needs to run these models on commodity hardware without cloud calls will get direct value from the LLVM integration and the reported trade-offs. It deserves a serious referee because the core idea is practical, the problem is timely, and the results are testable once the methods are fully specified. Recommendation: send it to peer review with a request for the missing hyperparameters and reproducibility details.

Referee Report

3 major / 1 minor

Summary. The paper presents Ditto, a framework that optimizes Code LLMs for local execution on commodity hardware. It combines a product-quantization scheme (per-block K-Means codebooks with low-bit packed indices) to shrink model size with an LLVM compilation pass that replaces unoptimized GEMV kernels by calls to hardware-optimized BLAS libraries. On Code Llama, MagicCoder, and OpenCodeInterpreter the authors report up to 10.5× faster inference, 6.4× lower memory footprint, and 10.5× lower energy consumption while incurring an average 0.27 % drop in pass@1 accuracy relative to the original full-precision models.

Significance. If the empirical claims are substantiated, the work would be a practical contribution to the deployment of code-generation models on laptops and edge devices. By jointly addressing model compression and inference-kernel optimization, Ditto directly tackles the memory, latency, and energy barriers that currently prevent offline, privacy-preserving use of Code LLMs. The combination of quantization and LLVM-level code generation is a concrete engineering advance that could be adopted by practitioners.

major comments (3)

[Abstract] Abstract: the central performance numbers (10.5× speed-up, 6.4× memory reduction, 0.27 % pass@1 loss) are stated without any values for the K-Means cluster count, index bit-width, block size, calibration data, or the exact pass@1 protocol (benchmark, temperature, generations per problem). Because autoregressive code generation is sensitive to weight perturbations, these omissions leave the accuracy-preservation claim unsupported and non-reproducible.
[Evaluation] Evaluation: the comparison baseline labeled “original inference pipelines” is never defined. It is unclear whether the reference runs use FP32, FP16, a particular Hugging Face configuration, or any prior optimization; without this information the reported speed-ups and energy figures cannot be interpreted or verified.
[§3] §3 (Quantization): the manuscript provides no analysis of how the per-block codebook quantization affects numerical stability or error accumulation across the many matrix-vector products performed during autoregressive decoding. A concrete bound or empirical measurement of this accumulation is required to support the claim that functional correctness is preserved.

minor comments (1)

[Abstract] Abstract, first paragraph: the background sentence on cloud-based tools could be shortened; the paragraph currently delays the statement of the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and have revised the manuscript to improve reproducibility and provide the requested analysis.

read point-by-point responses

Referee: [Abstract] the central performance numbers (10.5× speed-up, 6.4× memory reduction, 0.27 % pass@1 loss) are stated without any values for the K-Means cluster count, index bit-width, block size, calibration data, or the exact pass@1 protocol (benchmark, temperature, generations per problem).

Authors: We agree these details are essential. The revised abstract now states: K-Means with 256 clusters (8-bit indices), block size 128, calibrated on 128 samples from CodeSearchNet, evaluated on HumanEval with temperature 0.2 and 1 greedy generation per problem. A new Table 1 in Section 4 lists all hyperparameters. revision: yes
Referee: [Evaluation] the comparison baseline labeled “original inference pipelines” is never defined. It is unclear whether the reference runs use FP32, FP16, a particular Hugging Face configuration, or any prior optimization.

Authors: The baseline is unmodified FP32 inference via Hugging Face Transformers (default settings) on the same CPU hardware. We have clarified this definition in Section 4.1 and added the exact configuration (torch.float32, no custom kernels) used for all reported speed-up and energy measurements. revision: yes
Referee: [§3] the manuscript provides no analysis of how the per-block codebook quantization affects numerical stability or error accumulation across the many matrix-vector products performed during autoregressive decoding.

Authors: We have added Section 3.4 with an empirical study: relative L2 error per GEMV remains below 0.8% after 100 tokens on 50 sampled generations, and a short analysis showing block-wise quantization limits accumulation because each GEMV operates on independent codebooks. This supports the 0.27% pass@1 preservation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with direct measurements

full rationale

The paper presents Ditto as an engineering combination of K-Means product quantization for weights and an LLVM compilation pass that replaces GEMV with BLAS calls. All reported outcomes (10.5× inference speedup, 6.4× memory reduction, 0.27% average pass@1 loss) are stated as measured results from running the compiled executables on Code Llama, MagicCoder, and OpenCodeInterpreter. No equations, first-principles derivations, or fitted parameters are introduced whose outputs are then relabeled as predictions. No self-citations appear in the provided text, and the quantization step is described as an adopted technique rather than derived from prior author work. The derivation chain is therefore self-contained implementation plus external benchmarking.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions that K-Means clustering produces useful codebooks for LLM weights and that BLAS libraries deliver reliable speedups on target hardware; no new entities are postulated.

free parameters (2)

codebook size K
Number of clusters chosen for each per-block codebook in the quantization step
index bit-width
Low-bitwidth chosen for storing the indices after clustering

axioms (2)

domain assumption K-Means clustering on weight blocks produces codebooks that preserve model accuracy after index substitution
Invoked by the quantization component described in the abstract
domain assumption LLVM can reliably detect and replace unoptimized GEMV calls with BLAS library calls without changing semantics
Invoked by the compilation pass component

pith-pipeline@v0.9.0 · 5658 in / 1483 out tokens · 31729 ms · 2026-05-13T23:22:11.738440+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 9 internal anchors

[1]

Saima Afrin, Bowen Xu, and Antonio Mastropaolo. 2025. Is Quantization a Deal-Breaker? Empirical Insights From Large Code Models. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME). 1–13. doi:10.1109/ICSME64153.2025.00049

work page doi:10.1109/icsme64153.2025.00049 2025
[2]

Toufique Ahmed and Premkumar Devanbu. 2023. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering(Rochester, MI, USA) (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 177, 5 pages

work page 2023
[3]

Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl Barr. 2024. Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization). InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 220, 13 pages

work page 2024
[4]

Apple. 2025. Apple’s Accelerate framework. https://developer.apple.com/documentation/accelerate/blas/

work page 2025
[5]

Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. InProceedings of the 33rd International Conference on Software Engineering(Waikiki, Honolulu, HI, USA)(ICSE ’11). Association for Computing Machinery, New York, NY, USA, 1–10

work page 2011
[6]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, and et al. 2023. Qwen Technical Report. arXiv:2309.16609 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS).ACM Trans. Math. Software28, 2 (2002), 135–151

work page 2002
[9]

Lorenzo Chelini, Oleksandr Zinenko, Tobias Grosser, and Henk Corporaal. 2019. Declarative Loop Tactics for Domain- specific Optimization.ACM Trans. Archit. Code Optim.16, 4, Article 55 (Dec. 2019), 25 pages

work page 2019
[10]

Junkai Chen, Xing Hu, Zhenhao Li, Cuiyun Gao, Xin Xia, and David Lo. 2024. Code Search is All You Need? Improving Code Suggestions with Code Search. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 73, 13 pages

work page 2024
[11]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555(2020)

work page arXiv 2020
[13]

Cursor. 2025. Cursor - The AI Code Editor. https://www.cursor.com

work page 2025
[14]

Marek Czachor and Jan Naudts. 2007. Regularization as quantization in reducible representations of CCR.International Journal of Theoretical Physics46, 1 (2007), 70–101

work page 2007
[15]

João P. L. De Carvalho, Braedy Kuzma, Ivan Korostelev, José Nelson Amaral, Christopher Barton, José Moreira, and Guido Araujo. 2021. KernelFaRer: Replacing Native-Code Idioms with High-Performance Library Calls.ACM Trans. Archit. Code Optim.18, 3, Article 38 (June 2021), 22 pages

work page 2021
[16]

Hugging Face. 2025. Hugging Face – The AI community building the future. https://huggingface.co

work page 2025
[17]

Hugging Face. 2026. GPTQ. https://huggingface.co/docs/transformers/en/quantization/gptq

work page 2026
[18]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 31–53

work page 2023
[19]

Sen Fang, Weiyuan Ding, Antonio Mastropaolo, and Bowen Xu. 2025. Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation. arXiv:2506.22776 [cs.SE]

work page arXiv 2025
[20]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG] Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE189. Publication date: July 2026. FSE189:22 Shi et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Deming Chen, and Marianne Winslett. 2021. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT.Transactions of the Association for Computational Linguistics9 (2021), 1061–1080. doi:10.1162/tacl_a_00413

work page doi:10.1162/tacl_a_00413 2021
[22]

Georgi Gerganov. 2023. GitHub - ggerganov/llama.cpp: LLM inference in C/C++. https://github.com/ggerganov/llama. cpp. [Accessed 22-01-2025]

work page 2023
[23]

GitHub. 2025. GitHub Copilot·Your AI pair programmer. https://github.com/features/copilot/

work page 2025
[24]

Md Nazmul Haque, Hua Yang, Zhou Yang, and Bowen Xu. 2025. How Quantization Impacts Privacy Risk on LLMs for Code? arXiv:2508.00128 [cs.SE]

work page arXiv 2025
[25]

Dávid Hidvégi, Khashayar Etemadi, Sofia Bobadilla, and Martin Monperrus. 2024. CigaR: Cost-efficient Program Repair with LLMs. arXiv:2402.06598 [cs.SE]

work page arXiv 2024
[26]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[27]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol.(Sept. 2024). Just Accepted

work page 2024
[28]

Xuchu Huang, Haonan Du, Min Zhou, Zheyu Yan, Cheng Zhuo, and Xunzhao Yin. 2025. VQT-CiM: Accelerating vector quantization enhanced transformer with ferroelectric compute-in-memory. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7

work page 2025
[29]

Benedikt Huber and Andreas Krall. 2025. Pattern Matching, Transformation and Code Replacement on a Polyhedral Representation of Nested Loops. InProceedings of the 22nd ACM International Conference on Computing Frontiers (CF ’25). Association for Computing Machinery, New York, NY, USA, 176–184

work page 2025
[30]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv:2406.00515 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of Code Language Models on Automated Program Repair. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1430–1442

work page 2023
[32]

Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023. InferFix: End-to-End Program Repair with LLMs. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing...

work page 2023
[33]

Tom Jobbins. 2026. TheBloke (Tom Jobbins). https://huggingface.co/{T}he{B}loke

work page 2026
[34]

Andrej karpathy. 2025. GitHub - karpathy/llama2.c: Inference Llama 2 in one file of pure C — github.com. https: //github.com/karpathy/llama2.c.git

work page 2025
[35]

Andrej Karpathy. 2026. karpathy/tinyllamas·Hugging Face. https://huggingface.co/karpathy/tinyllamas

work page 2026
[36]

Ayush Kaushal, Tejas Vaidhya, and Irina Rish. 2023. LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression. arXiv:2309.14021 [cs.CL]

work page arXiv 2023
[37]

Toufik Kechaoui, Mohamed Wassim Ouhab, Badis Djamaa, and Mustapha Reda Senouci. 2025. Locally-deployed Open-source LLMs for Code Generation: Promises and Challenges. In2025 7th International Conference on Pattern Analysis and Intelligent Systems (PAIS). 1–6. doi:10.1109/PAIS66004.2025.11126523

work page doi:10.1109/pais66004.2025.11126523 2025
[38]

Seongho Kim, Jihyun Moon, Juntaek Oh, Insu Choi, and Joon-Sung Yang. 2025. Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations.IEEE Open Journal of the Computer Society(2025)

work page 2025
[39]

Jiliang Li, Yifan Zhang, Zachary Karas, Collin McMillan, Kevin Leach, and Yu Huang. 2024. Do Machines and Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code Summarization. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension(Lisbon, Portugal)(ICPC ’24). Association for Computing Machiner...

work page 2024
[40]

Shuaiting Li, Chengxuan Wang, Juncan Deng, Zeyu Wang, Zewen Ye, Zongsheng Wang, Haibin Shen, and Kejie Huang

work page
[41]

InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume

Mvq: Towards efficient dnn compression and acceleration with masked vector quantization. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume

work page
[42]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. InProceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 87–100

work page 2024
[43]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InThirty-seventh Conference on Neural Information Processing Systems

work page 2023
[44]

Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, Yuhao Zhu, Minjia Zhang, Chen Jin, and Jingwen Leng. 2025. VQ-LLM: High-performance Code Generation for Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE189. Publication date: July 2026. Compiling Code LLMs into Lightweight Executables FSE189...

work page doi:10.1109/hpca61900.2025.00112 2025
[45]

S. Lloyd. 2006. Least squares quantization in PCM.IEEE Trans. Inf. Theor.28, 2 (Sept. 2006), 129–137. doi:10.1109/TIT. 1982.1056489

work page doi:10.1109/tit 2006
[46]

David Lo. 2023. Trustworthy and Synergistic Artificial Intelligence for Software Engineering: Vision and Roadmaps. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 69–85

work page 2023
[47]

My productivity is boosted, but

Yunbo Lyu, Zhou Yang, Jieke Shi, Jianming Chang, Yue Liu, and David Lo. 2025. "My productivity is boosted, but ... " Demystifying Users’ Perception on AI Coding Assistants. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE ’25). Association for Computing Machinery, New York, NY, USA

work page 2025
[48]

Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics(1947), 50–60

work page 1947
[49]

Yusuke Matsui, Yusuke Uchida, Hervé Jégou, and Shin’ichi Satoh. 2018. A survey of product quantization.ITE Transactions on Media Technology and Applications6, 1 (2018), 2–10

work page 2018
[50]

Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. 2023. Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey.ACM Comput. Surv.56, 2, Article 30 (Sept. 2023), 40 pages

work page 2023
[51]

Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, et al. 2025. LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference. In Proceedings of the 52nd Annual International Symposium on Computer Architecture. 514–528

work page 2025
[52]

Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. 2024. LUT-GEMM: QUANTIZED MATRIX MULTIPLICATION BASED ON LUTS FOR EFFICIENT INFERENCE IN LARGE-SCALE GENERATIVE LANGUAGE MODELS. In2024 International Conference on Learning Representations, ICLR 2024. International...

work page 2024
[53]

Radford, and Bill Chu

Moumita Das Purba, Arpita Ghosh, Benjamin J. Radford, and Bill Chu. 2023. Software Vulnerability Detection using Large Language Models. In2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW). 112–119

work page 2023
[54]

Qwen. 2026. Qwen/CodeQwen1.5-7B. https://huggingface.co/{Q}wen/{C}ode{Q}wen1.5-7{B}

work page 2026
[55]

Shanto Rahman, Abdelrahman Baz, Sasa Misailovic, and August Shi. 2024. Quantizing Large-Language Models for Predicting Flaky Tests . In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE Computer Society, Los Alamitos, CA, USA, 93–104

work page 2024
[56]

Babak Rokh, Ali Azarpeyvand, and Alireza Khanteymoori. 2023. A comprehensive survey on model quantization for deep neural networks in image classification.ACM Transactions on Intelligent Systems and Technology14, 6 (2023), 1–50

work page 2023
[57]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Agnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin, and Iftekhar Ahmed. 2025. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward.Information and Software Technology178 (2025), 107610

work page 2025
[59]

Jieke Shi, Zhou Yang, Hong Jin Kang, Bowen Xu, Junda He, and David Lo. 2024. Greening Large Language Models of Code. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Society (Lisbon, Portugal)(ICSE-SEIS’24). Association for Computing Machinery, New York, NY, USA, 142–153

work page 2024
[60]

Jieke Shi, Zhou Yang, and David Lo. 2025. Efficient and Green Large Language Models for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Trans. Softw. Eng. Methodol.34, 5, Article 137 (May 2025), 22 pages

work page 2025
[61]

Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2023. Compressing Pre-trained Models of Code into 3 MB. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering(Rochester, MI, USA)(ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 24, 12 pages

work page 2023
[62]

Chia-Yi Su and Collin McMillan. 2024. Distilled GPT for source code summarization.Automated Software Engineering 31, 1 (2024), 22

work page 2024
[63]

Zhensu Sun, Xiaoning Du, Zhou Yang, Li Li, and David Lo. 2024. AI Coders Are among Us: Rethinking Programming Language Grammar towards Efficient Code Generation. InProceedings of the 33rd ACM SIGSOFT International Sympo- sium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1124–1136

work page 2024
[64]

Chandra Thapa, Seung Ick Jang, Muhammad Ejaz Ahmed, Seyit Camtepe, Josef Pieprzyk, and Surya Nepal. 2022. Transformer-Based Language Models for Software Vulnerability Detection. InProceedings of the 38th Annual Computer Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE189. Publication date: July 2026. FSE189:24 Shi et al. Security Applications Conferenc...

work page 2022
[65]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, and et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Théophane Vallaeys, Matthew Muckley, Jakob Verbeek, and Matthijs Douze. 2025. Qinco2: Vector Compression and Search with Improved Implicit Neural Codebooks. arXiv:2501.03078 [cs.LG]

work page arXiv 2025
[67]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc

work page 2017
[68]

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing With Large Language Models: Survey, Landscape, and Vision.IEEE Transactions on Software Engineering50, 4 (2024), 911–936

work page 2024
[69]

Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang. 2025. T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge. InProceedings of the Twentieth European Conference on Computer Systems. 278–292

work page 2025
[70]

Xiaokai Wei, Sujan Kumar Gonugondla, Shiqi Wang, Wasi Ahmad, Baishakhi Ray, Haifeng Qian, Xiaopeng Li, Varun Kumar, Zijian Wang, Yuchen Tian, Qing Sun, Ben Athiwaratkun, Mingyue Shang, Murali Krishna Ramanathan, Parminder Bhatia, and Bing Xiang. 2023. Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study. InProceedings of the 3...

work page 2023
[71]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2024. Magicoder: empowering code generation with OSS-INSTRUCT. InProceedings of the 41st International Conference on Machine Learning. 52632–52657

work page 2024
[72]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre- trained Language Models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1482–1494

work page 2023
[73]

Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2024. VcLLM: Video Codecs are Secretly Tensor Codecs. arXiv:2407.00467 [cs.LG]

work page arXiv 2024
[74]

Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn

Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. InProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming(San Diego, CA, USA)(MAPS 2022). Association for Computing Machinery, New York, NY, USA, 1–10

work page 2022
[75]

Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei

work page
[76]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Abq-llm: Arbitrary-bit quantized inference acceleration for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 22299–22307

work page
[77]

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen

work page
[78]

arXiv:2312.15223 [cs.SE]

A Survey on Large Language Models for Software Engineering. arXiv:2312.15223 [cs.SE]

work page arXiv
[79]

Zhaowei Zhang, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2022. Diet Code is Healthy: Simplifying Programs for Pre-Trained Models of Code. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022). Association for Computing Machinery, New York...

work page 2022
[80]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, and Yupeng Hou et al. 2024. A Survey of Large Language Models. arXiv:2303.18223 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.