arxiv: 2604.09952 · v1 · submitted 2026-04-10 · 💻 cs.LG

Recognition: unknown

SLM Finetuning for Natural Language to Domain Specific Code Generation in Production

Renjini R. Nair (Microsoft) , Damian K. Kowalczyk (Microsoft) , Marco Gaudesi (Microsoft) , Chhaya Methani (Microsoft)

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords small language modelsfine-tuningnatural language to codedomain specific languageproduction deploymentlatency optimizationmodel customization

0 comments

The pith

Fine-tuning small language models on natural language to domain-specific code pairs improves performance and latency over larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates fine-tuning small language models for turning natural language into domain-specific code. It finds that these tuned models deliver better accuracy and faster responses than larger language models on held-out test data. The same models can receive extra fine-tuning for particular customer needs while retaining their broad capabilities. Load testing and live production deployment confirm the gains hold under real conditions. This matters for systems that must generate code quickly without the resource demands of very large models.

Core claim

Fine-tuning variants of Mistral and other small language models on a dataset of natural language to domain-specific code pairs produces models that achieve improved performance and lower latency on test datasets compared to larger models. These fine-tuned models can be further tuned for customer-specific scenarios without degrading general performance, and load testing followed by production deployment verified optimal latency and quality.

What carries the argument

Fine-tuning small language models on pairs of natural language queries and matching domain-specific code outputs to embed task knowledge directly into the model weights.

If this is right

Fine-tuned small models achieve improved performance and latency on test datasets compared to larger models.
The trained model can be further fine-tuned for customer specific scenarios without degrading general performance.
Load testing and production deployment confirm optimal performance in terms of latency and quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Task-specific fine-tuning may allow production systems to drop complex retrieval pipelines that were previously needed to supply domain context at runtime.
The same tuning process could transfer to other latency-sensitive generation tasks that currently rely on large models.
Operational costs could drop because smaller models require less compute per inference while matching or exceeding larger-model quality on the target domain.

Load-bearing premise

The dataset of natural language to domain-specific code pairs used for fine-tuning is representative of real production queries.

What would settle it

Deploying the fine-tuned small model in live production traffic and observing higher error rates, hallucinations, or latency exceeding the larger baseline model under comparable load.

Figures

Figures reproduced from arXiv: 2604.09952 by Chhaya Methani (Microsoft), Damian K. Kowalczyk (Microsoft), Marco Gaudesi (Microsoft), Renjini R. Nair (Microsoft).

**Figure 1.** Figure 1: Overview of the experimental design ahead of production. Prior work has explored fine-tuning language models for structured code and DSL generation (e.g., text-to-SQL and program synthesis), as well as parameter-efficient adaptation techniques such as LoRA. Our work does not introduce a new fine-tuning method; rather, it provides a production case study comparing fine-tuned Small Language Models with LLM-b… view at source ↗

read the original abstract

Many applications today use large language models for code generation; however, production systems have strict latency requirements that can be difficult to meet with large models. Small language models with a few billion parameters are resource efficient but may suffer from limited reasoning, hallucinations, or poor retention of longer context. Fine tuning improves task specific accuracy by embedding domain knowledge directly into model weights, reducing reliance on runtime context. We previously implemented a baseline natural language to code generation approach using a retrieval augmented generation pipeline that dynamically selected few shot examples to embed domain specific language context for a large language model. In this study, we evaluate small language models for generating domain specific language from natural language by fine tuning variants of Mistral and other models on a dataset of natural language code pairs. Our results show that the fine-tuned models achieve improved performance and latency on test datasets compared to larger models. We also demonstrate that the trained model can be further fine-tuned for customer specific scenarios without degrading general performance, helping resolve production issues. Load testing followed by production deployment confirmed optimal performance in terms of latency and quality. These findings demonstrate that task specific fine tuning with small language models provides an efficient, faster, and cost-effective alternative to large language models for domain specific language generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical report on fine-tuning small models for production DSL generation that highlights latency benefits and safe adaptation, though it lacks the numbers needed to fully evaluate the results.

read the letter

The punchline is that fine-tuning small language models on natural language to domain-specific code pairs can give better performance and lower latency than larger models on test data, plus it allows further tuning for specific customers without hurting the general case, leading to a production deployment. This work applies routine fine-tuning techniques to a real production challenge. The authors describe starting with retrieval-augmented generation using large models and then shifting to fine-tuned small models to embed domain knowledge directly. That shift addresses latency and cost concerns effectively in their setup. The ability to do additional customer-specific fine-tuning while maintaining general performance is a practical advantage for multi-tenant systems. The main limitation is the absence of concrete numbers. The abstract mentions improved performance and latency but provides no deltas, no baseline details, no accuracy scores, and no latency measurements in milliseconds or tokens per second. There is also little information on how the test datasets were constructed, whether they include edge cases, or how general performance was measured after the second round of fine-tuning. This makes it difficult to assess if the gains would hold under varied live production loads or if there is any forgetting on non-domain tasks. The paper is best suited for practitioners building AI coding tools in enterprise settings with domain-specific languages. It offers a case study on making these systems efficient enough for production use. I would recommend sending this to peer review because the production deployment experience adds value to the literature on efficient LLM use, provided the authors include the quantitative results and more on their evaluation protocol in a revision.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that fine-tuning small language models (variants of Mistral and similar) on natural language to domain-specific code pairs yields improved task performance and lower latency than larger models on test datasets. It further asserts that the resulting models can undergo additional customer-specific fine-tuning without degrading general performance, with load testing and production deployment confirming suitability for real-world use as an efficient alternative to RAG-based LLM pipelines.

Significance. If the empirical results were rigorously quantified with proper baselines, metrics, and generalization checks, the work would demonstrate a practical, deployable approach for latency-sensitive domain-specific code generation using SLMs, potentially reducing costs and inference times in production systems while preserving adaptability.

major comments (2)

[Abstract and Results] Abstract and results presentation: the central claims of 'improved performance and latency' relative to larger models, plus 'without degrading general performance' after customer fine-tuning, are stated without any quantitative metrics, baseline comparisons, statistical tests, data-split details, or evaluation protocols. This directly undermines verification of the production-deployment conclusion.
[Evaluation and Deployment] Evaluation and deployment sections: no out-of-distribution tests, edge-case analysis, or non-domain task checks are reported to support the assumption that test-set gains will hold under live production load without hidden degradation. This is load-bearing for the claim that load testing confirmed optimal performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the empirical presentation of our work. We address each major comment below and have revised the manuscript to provide the requested quantitative details, baselines, and additional evaluations.

read point-by-point responses

Referee: [Abstract and Results] Abstract and results presentation: the central claims of 'improved performance and latency' relative to larger models, plus 'without degrading general performance' after customer fine-tuning, are stated without any quantitative metrics, baseline comparisons, statistical tests, data-split details, or evaluation protocols. This directly undermines verification of the production-deployment conclusion.

Authors: We agree that the original abstract and results sections presented the claims at a high level without the supporting quantitative details, baselines, statistical tests, data splits, or protocol descriptions needed for full verification. In the revised manuscript we have expanded both the abstract and results section to include the specific performance and latency metrics from our experiments, direct comparisons against larger models and the prior RAG baseline, statistical significance testing, explicit train/test split ratios, and a complete description of the evaluation protocol. These additions directly substantiate the production-deployment conclusions. revision: yes
Referee: [Evaluation and Deployment] Evaluation and deployment sections: no out-of-distribution tests, edge-case analysis, or non-domain task checks are reported to support the assumption that test-set gains will hold under live production load without hidden degradation. This is load-bearing for the claim that load testing confirmed optimal performance.

Authors: We acknowledge that the original manuscript did not report explicit out-of-distribution tests, edge-case analysis, or non-domain task checks. In the revised version we have added a dedicated subsection on generalization and robustness that includes OOD evaluation on unseen domain queries, analysis of edge cases (ambiguous inputs, longer contexts), and verification that general capabilities are preserved on non-domain tasks. The load-testing section has also been expanded with detailed metrics under production-like loads to confirm that test-set gains translate without hidden degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims are self-contained

full rationale

The paper reports empirical results from fine-tuning small language models on natural language to domain-specific code pairs, with direct comparisons of task performance, latency, and further customer-specific fine-tuning against larger models and prior RAG baselines. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions exist that would reduce any claimed outcome to its own inputs by construction. Self-references to prior work are limited to contextual setup and do not bear the load of the reported improvements, which rest on independent test-set evaluations and production load testing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical study with no explicit axioms, free parameters, or invented entities described in the abstract. The central claim depends on the assumption that fine-tuning embeds domain knowledge effectively and that test performance predicts production behavior.

pith-pipeline@v0.9.0 · 5538 in / 1104 out tokens · 51707 ms · 2026-05-10T16:33:14.781145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 24 canonical work pages · 12 internal anchors

[1]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., and et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Bappy, M. A. H., Mustafa, H. A., Saha, P., and Salehat, R. Case study: Fine-tuning small language models for accurate and private cwe detection in python code. arXiv preprint arXiv:2504.16584, 2025. URL https://arxiv.org/abs/2504.16584

work page internal anchor Pith review arXiv 2025
[3]

and Methani, C

Bassamzadeh, N. and Methani, C. A comparative study of dsl code generation: Fine-tuning vs. optimized retrieval augmentation, 2024. URL https://arxiv.org/abs/2407.02742

work page arXiv 2024
[4]

Enhancing the reasoning capabilities of small language models via solution guidance fine-tuning

Bi, J., Wu, Y., Xing, W., and Wei, Z. Enhancing the reasoning capabilities of small language models via solution guidance fine-tuning. arXiv preprint arXiv:2412.09906, 2024. URL https://arxiv.org/abs/2412.09906

work page arXiv 2024
[5]

Language models are few-shot learners

Brown, T., Mann, B., Ryder, N., et al. Language models are few-shot learners. Advances in neural information processing systems, 33, 2020

2020
[6]

arXiv preprint arXiv:2505.01976 , year =

Chen, K., Zhou, X., Lin, Y., Feng, S., Shen, L., and Wu, P. A survey on privacy risks and protection in large language models, 2025. URL https://arxiv.org/abs/2505.01976

work page arXiv 2025
[7]

20 StevenChiang, YiwenLu, QihanLiu, AndrewChen, PonyMa, andMindLab

Chen, L., Ye, Z., Wu, Y., Zhuo, D., Ceze, L., and Krishnamurthy, A. Punica: Multi-tenant lora serving, 2023. URL https://arxiv.org/abs/2310.18547

work page arXiv 2023
[8]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., and et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Reinforcement learning for text-to-sql generation with a relevance-based reward

Chen, Y., Jiang, Z., Chen, W., Liu, X., and Gao, J. Reinforcement learning for text-to-sql generation with a relevance-based reward. In ACL, 2020

2020
[10]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review arXiv 2022
[11]

QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers, T., Pagnoni, A., Holtzman, A., et al. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review arXiv 2023
[12]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y., Wallis, P., et al. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., and El Sayed, W. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023. URL https://arxiv.org/abs/2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention, 2023. URL https://arxiv.org/abs/2309.06180

work page internal anchor Pith review arXiv 2023
[15]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Lewis, P., Perez, E., Piktus, A., et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, 2020

2020
[17]

Dynasp: Dynamic schema prompting for table-based text-to-sql generation

Li, Z., Zhang, Y., Guo, Y., and Liu, J. Dynasp: Dynamic schema prompting for table-based text-to-sql generation. In ACL, 2023 b

2023
[18]

Prompt engineering techniques for nlp tasks

Liu, P., Yuan, W., Fu, J., et al. Prompt engineering techniques for nlp tasks. arXiv preprint arXiv:2302.00363, 2023

work page arXiv 2023
[19]

Locust: A modern load testing framework

Locust Developers . Locust: A modern load testing framework. https://locust.io, 2025. Accessed: 2025-05-14

2025
[20]

Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022

Min, S., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. URL https://arxiv.org/abs/2202.12837

work page arXiv 2022
[21]

Calibrated language models must hallucinate

Min, S., Holtzman, A., and Hajishirzi, H. Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648, 2023. URL https://arxiv.org/abs/2311.14648

work page arXiv 2023
[22]

Introducing mistral nemo

Mistral AI and NVIDIA . Introducing mistral nemo. https://mistral.ai/news/mistral-nemo, 2024. Accessed: 2025-05-14

2024
[23]

Openai codex

OpenAI. Openai codex. https://platform.openai.com/docs/models/codex, 2021. Accessed: 2025-05-14

2021
[24]

GPT-4o System Card

OpenAI. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. URL https://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Comprehensive review of load testing tools

Patel, N., Patel, R., and Patel, D. Comprehensive review of load testing tools. International Research Journal of Engineering and Technology (IRJET), 7 0 (5): 0 651--655, 2020. URL https://www.irjet.net/archives/V7/i5/IRJET-V7I5651.pdf

2020
[26]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. arXiv preprint arXiv:2007.01868, 2020

work page arXiv 2007
[27]

Phi-2: The surprising power of small language models

Research, M. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2, 2023. Accessed: 2025-05-14

2023
[28]

Beyond memorization: Violating privacy via inference with large language models.arXiv preprint arXiv:2310.07298, 2023

Staab, R., Vero, M., Balunović, M., and Vechev, M. Beyond memorization: Violating privacy via inference with large language models, 2024. URL https://arxiv.org/abs/2310.07298

work page arXiv 2024
[29]

Small language models (slms) can still pack a punch: A survey, 2025

Subramanian, S., Elango, V., and Gungor, M. Small language models (slms) can still pack a punch: A survey, 2025. URL https://arxiv.org/abs/2501.05465

work page internal anchor Pith review arXiv 2025
[30]

and Baghdadi, R

Wee, P. and Baghdadi, R. Exploring the knowledge mismatch hypothesis: Hallucination propensity in small models fine-tuned on data from larger models. arXiv preprint arXiv:2411.00878, 2024. URL https://arxiv.org/abs/2411.00878

work page arXiv 2024
[31]

Textbooks Are All You Need II: phi-1.5 technical report

Xu, C., Wu, S., Wang, Z., et al. Small language models are also few-shot learners. arXiv preprint arXiv:2309.05463, 2023

work page internal anchor Pith review arXiv 2023
[32]

and Neubig, G

Yin, P. and Neubig, G. Tranx: A transition-based neural abstract syntax parser for semantic parsing and code generation. In Proceedings of EMNLP, 2018

2018
[33]

arXiv preprint arXiv:2110.06500 , year=

Yu, D., Naik, S., Backurs, A., Gopi, S., Inan, H. A., Kamath, G., Kulkarni, J., Lee, Y. T., Manoel, A., Wutschitz, L., Yekhanin, S., and Zhang, H. Differentially private fine-tuning of language models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv.org/abs/2110.06500

work page arXiv 2022
[34]

Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham

Yuan, Z., Diao, Q., Shen, Y., et al. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764, 2023. URL https://arxiv.org/abs/2308.11764

work page arXiv 2023
[35]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2023

work page internal anchor Pith review arXiv 2023