Empirical Study for Structured Output Control in LLMs for Software Engineering

Jacques Klein; Prateek Rajput; Saad Ezzini; Tegawend\'e F. Bissyand\'e; Tiezhu Sun; Yewei Song

arxiv: 2606.09395 · v1 · pith:BKLLJ5IDnew · submitted 2026-06-08 · 💻 cs.SE

Empirical Study for Structured Output Control in LLMs for Software Engineering

Yewei Song , Prateek Rajput , Tiezhu Sun , Saad Ezzini , Tegawend\'e F. Bissyand\'e , Jacques Klein This is my paper

Pith reviewed 2026-06-27 15:40 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLMstructured outputsoftware engineeringsyntax errorsgrammar-constrained decodingsemantic errorsoutput controltemplate matching

0 comments

The pith

Template-driven control in LLMs nearly eliminates syntax errors on software engineering tasks but leaves structural and semantic errors largely intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether techniques that force LLMs to follow output formats can make their results usable in real software engineering pipelines that demand exact structures. It runs four representative tasks through three mitigation methods, one of which is a strict template token match approach called TTMG. The data show syntax errors drop sharply with TTMG, yet structural mismatches and semantic mistakes stay common. This distinction matters because a correctly intended answer that violates the expected format is rejected by downstream tools just like a wrong answer. The work therefore argues that format enforcement alone cannot solve the reliability problem in LLM-driven SE workflows.

Core claim

On four software engineering tasks, grammar-constrained decoding, regex validation, and TTMG were compared. TTMG nearly eliminates syntax errors, yet substantial structural and semantic errors persist, demonstrating that the core bottleneck lies beyond syntax formatting. A case study shows how the remaining errors propagate into downstream failures. The findings indicate that current structure-enforcing tools are necessary but insufficient.

What carries the argument

Template Token Match Generation (TTMG), a strict template-driven decoding method that forces token-by-token adherence to a predefined output skeleton.

If this is right

Residual structural and semantic errors cascade into failures when LLM outputs are fed into toolchains and APIs.
Structure-enforcing methods must be paired with mechanisms that also verify semantic correctness.
Autoregressive generation's local focus creates fragility whenever target formats differ from common training data.
Deploying LLMs in practice requires outputs that satisfy both format contracts and intended meaning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern may appear in other domains that impose rigid output schemas, such as database queries or API calls.
Future work could test whether fine-tuning on structure-aware data reduces the semantic gap that decoding controls leave behind.
If semantic errors prove harder to fix than structural ones, model training rather than inference-time constraints may become the primary lever.

Load-bearing premise

The four chosen SE tasks and three mitigation techniques are representative enough to conclude that structure-enforcing tools are necessary but insufficient in general.

What would settle it

An experiment on the same four tasks in which any single structure-enforcing method also reduces structural and semantic error rates by a large margin would falsify the claim.

read the original abstract

LLM-generated outputs in software engineering rarely exist in isolation. They must plug into toolchains, APIs, and data pipelines that impose strict, often organization-specific structural contracts. A semantically correct output that violates the expected format is, from the consuming system's perspective, indistinguishable from a wrong answer, making structural fidelity an operational prerequisite for deploying LLMs in practice. Yet current models routinely produce syntactically invalid or structurally non-compliant outputs. Unlike encoders, autoregressive decoders generate text token-by-token with a local rather than global focus, amplifying structural fragility whenever the target format deviates from familiar training distributions. We present a systematic evaluation of structural reliability across four representative SE tasks, categorizing failures into syntax, structural, and semantic errors. We benchmark ways of mitigation targeting the decoder: grammar-constrained decoding, regex-based validation, and a strict template-driven control (Template Token Match Generation, TTMG) to isolate the sources of these failures. TTMG nearly eliminates syntax errors, yet substantial structural and semantic errors persist, demonstrating that the core bottleneck lies beyond syntax formatting. A detailed case study further illustrates how residual errors cascade in downstream workflows. Our findings show that current structure-enforcing tools are necessary but insufficient, and highlight the need for approaches that jointly ensure structural fidelity and semantic correctness in LLM-driven workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TTMG cuts syntax errors in LLM SE outputs but leaves structural and semantic ones, with the error split needing explicit rules to back the claim.

read the letter

The main point here is that TTMG nearly wipes out syntax errors across the four SE tasks but structural and semantic errors stay common, so format control alone does not solve the integration problem in real pipelines.

The work introduces TTMG as a strict template method and applies a three-way error split (syntax, structural, semantic) plus a cascading case study. That gives a direct empirical comparison against grammar-constrained decoding and regex validation, and the finding that syntax fixes do not touch the deeper issues is new relative to prior constrained-decoding papers.

It does a solid job framing the practical barrier for toolchains and showing that the bottleneck moves beyond token-level syntax once you enforce templates. The case study adds a useful concrete illustration of downstream effects.

The soft spot is the taxonomy itself. The abstract and description do not give a clear, reproducible decision procedure for labeling an error as structural versus semantic after TTMG has run, so it is possible some residual errors are still template violations that the method did not fully constrain. The four tasks are reasonable but narrow; without more variety or statistical detail on metrics and tests, the generalization that structure tools are necessary but insufficient rests on limited evidence.

This is for people building LLM-based SE tools who need to know where format enforcement stops working. It has enough empirical grounding and a clear practical question to deserve peer review, though the classification criteria will need tightening.

Referee Report

1 major / 0 minor

Summary. The paper claims that LLM outputs for software engineering tasks frequently violate structural contracts required by downstream toolchains, and that while mitigation techniques like Template Token Match Generation (TTMG) nearly eliminate syntax errors, substantial structural and semantic errors persist. Through evaluation on four representative SE tasks using grammar-constrained decoding, regex validation, and TTMG, plus a case study of error cascading, the work concludes that current structure-enforcing tools are necessary but insufficient, highlighting the need for approaches addressing both structural fidelity and semantic correctness.

Significance. If the empirical distinction between eliminated syntax errors and persistent deeper errors holds, the study supplies concrete evidence that decoder-level formatting constraints alone cannot guarantee usable outputs in SE pipelines. This strengthens the case for research on joint structural-semantic controls and provides benchmarks plus a workflow case study that could inform tool design for LLM integration in software engineering.

major comments (1)

[Error taxonomy / methods section] The central claim—that TTMG removes syntax errors while leaving genuine structural and semantic errors—depends on a reproducible separation of error types after mitigation. The manuscript's description of the error taxonomy (abstract and methods) does not supply an explicit decision procedure or classification rubric independent of the mitigation technique itself; without it, residual errors could still reflect incomplete template constraints rather than deeper failures, weakening the inference that the bottleneck lies beyond syntax formatting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment regarding the error taxonomy. The concern about ensuring an explicit, reproducible classification procedure independent of the mitigation techniques is valid and will improve the clarity of our claims. We address the point below and commit to a revision that adds the requested detail.

read point-by-point responses

Referee: The central claim—that TTMG removes syntax errors while leaving genuine structural and semantic errors—depends on a reproducible separation of error types after mitigation. The manuscript's description of the error taxonomy (abstract and methods) does not supply an explicit decision procedure or classification rubric independent of the mitigation technique itself; without it, residual errors could still reflect incomplete template constraints rather than deeper failures, weakening the inference that the bottleneck lies beyond syntax formatting.

Authors: We agree that an explicit decision procedure strengthens the paper. The taxonomy is defined independently of any mitigation: (1) syntax errors fail basic parsing according to the language grammar (e.g., invalid JSON, unbalanced brackets); (2) structural errors parse successfully but violate the task-specific schema or template (e.g., missing required keys, incorrect nesting or cardinality); (3) semantic errors match both syntax and structure but contain incorrect content relative to the task specification (e.g., wrong values or logic). Classification is performed post-generation by automated validators plus manual review against the full task requirements, not against the mitigation template alone. TTMG templates are derived directly from the structural contracts of each SE task; any remaining structural violations after TTMG therefore indicate failures beyond the enforced template (such as token-level mismatches or unmodeled constraints). Nevertheless, we acknowledge the methods section would benefit from a dedicated subsection with a decision tree, per-task examples, and inter-annotator agreement statistics. We will add this in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of observations

full rationale

The paper is an empirical study that reports direct experimental results on four SE tasks, error categorizations, and mitigation techniques without any equations, derivations, fitted parameters, predictions, or mathematical claims. Claims rest on observed outputs rather than any self-referential construction or self-citation chain. The central finding that TTMG reduces syntax errors but not all structural/semantic errors is a direct observation from the experiments and does not reduce to its inputs by definition or prior self-work. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation study with no mathematical derivations, fitted parameters, or new theoretical entities; relies on experimental observations and standard SE task definitions.

pith-pipeline@v0.9.1-grok · 5785 in / 1039 out tokens · 18402 ms · 2026-06-27T15:40:50.308847+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 1 canonical work pages

[1]

arXiv preprint arXiv:2510.22620 (2025)

Bazinska, J., Mathys, M., Casucci, F., Rojas-Carulla, M., Davies, X., Souly, A., Pfister, N.: Breaking agent backbones: Evaluating the security of backbone llms in ai agents. arXiv preprint arXiv:2510.22620 (2025)

arXiv 2025
[2]

arXiv preprint arXiv:2403.06988 (2024)

Beurer-Kellner, L., Fischer, M., Vechev, M.: Guiding llms the right way: Fast, non- invasive constrained generation. arXiv preprint arXiv:2403.06988 (2024)

arXiv 2024
[3]

arXiv preprint arXiv:2502.14425 (2025)

Cheng, Y., Chang, Y., Wu, Y.: A survey on data contamination for large language models. arXiv preprint arXiv:2502.14425 (2025)

arXiv 2025
[4]

Advances in Neural Information Processing Systems 37, 92420–92464 (2024)

Dekoninck, J., M¨ uller, M.N., Vechev, M.: Constat: Performance-based contamination detection in large language models. Advances in Neural Information Processing Systems 37, 92420–92464 (2024)

2024
[5]

arXiv preprint arXiv:2411.15100 (2024)

Dong, Y., Ruan, C.F., Cai, Y., Lai, R., Xu, Z., Zhao, Y., Chen, T.: Xgrammar: Flexible and efficient structured generation engine for large language models. arXiv preprint arXiv:2411.15100 (2024)

arXiv 2024
[6]

In: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software En- gineering (ICSE-FoSE), pp

Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., Zhang, J.M.: Large language models for software engineering: Survey and open problems. In: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software En- gineering (ICSE-FoSE), pp. 31–53. IEEE (2023)

2023
[7]

In: 13th Symposium on Languages, Applications and Technologies (SLATE 2024), pp

Faria, D., Baptista, T.J., Henriques, P.R.: Upgrade of lark compiler generator to support attribute grammars. In: 13th Symposium on Languages, Applications and Technologies (SLATE 2024), pp. 7–1. Schloss Dagstuhl–Leibniz-Zentrum f¨ ur Informatik (2024)

2024
[8]

Gat, N., contributors: Lm format enforcer: Enforce the output format (json schema, regex etc) of a language model.https://github.com/noamgat/lm-format-enforcer
[9]

arXiv preprint arXiv:2501.10868 (2025)

Geng, S., Cooper, H., Moskal, M., Jenkins, S., Berman, J., Ranchin, N., West, R., Horvitz, E., Nori, H.: Json-schemabench: A rigorous benchmark of structured outputs for language models. arXiv preprint arXiv:2501.10868 (2025)

arXiv 2025
[10]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp

Geng, S., Josifoski, M., Peyrard, M., West, R.: Grammar-constrained decoding for struc- tured nlp tasks without finetuning. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10932–10952 (2023)

2023
[11]

Guidance AI: llgtrt: LLM Guidance TensorRT.https://github.com/guidance-ai/ llgtrt
[12]

com/guidance-ai/guidance(2023)

Guidance AI: Guidance: A language model programming framework.https://github. com/guidance-ai/guidance(2023)

2023
[13]

arXiv preprint arXiv:2401.04088 (2024)

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

Pith/arXiv arXiv 2024
[14]

Proceedings of Machine Learning and Systems5, 457–476 (2023)

Kuchnik, M., Smith, V., Amvrosiadis, G.: Validating large language models with relm. Proceedings of Machine Learning and Systems5, 457–476 (2023)

2023
[15]

Laiyer.ai: LLM Guard: The Security Toolkit for Large Language Models.https:// llm-guard.com/
[16]

URLhttps://www.langchain.com/

LangChain: Langchain official website (2025). URLhttps://www.langchain.com/

2025
[17]

ISPRS International Journal of Geo-Information13(11), 405 (2024)

Li, D., Zhao, Y., Wang, Z., Jung, C., Zhang, Z.: Large language model-driven structured output: A comprehensive benchmark and spatial data generation framework. ISPRS International Journal of Geo-Information13(11), 405 (2024)

2024
[18]

arXiv preprint arXiv:2412.19437 (2024)

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

Pith/arXiv arXiv 2024
[19]

In: Extended Abstracts of the CHI Conference on Human Factors in Computing Sys- tems, pp

Liu, M.X., Liu, F., Fiannaca, A.J., Koo, T., Dixon, L., Terry, M., Cai, C.J.: ‘we need structured output’: Towards user-centered constraints on large language model output. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Sys- tems, pp. 1–9 (2024)

2024
[20]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Lu, Y., Li, H., Cong, X., Zhang, Z., Wu, Y., Lin, Y., Liu, Z., Liu, F., Sun, M.: Learning to generate structured output with schema reinforcement learning. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4905–4918 (2025)

2025
[21]

In: Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025), pp

Margiotta, D., Croce, D., Basili, R.: Evaluating large language models on wikipedia graph navigation: Insights from the wikigame. In: Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025), pp. 659–669 (2025) 34 Yewei Song 1 et al

2025
[22]

arXiv preprint arXiv:2309.13638 (2023)

McCoy, R.T., Yao, S., Friedman, D., Hardy, M., Griffiths, T.L.: Embers of autoregres- sion: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638 (2023)

arXiv 2023
[23]

arXiv preprint arXiv:2502.02649 (2025)

Mitchell, M., Ghosh, A., Luccioni, A.S., Pistilli, G.: Fully autonomous ai agents should not be developed. arXiv preprint arXiv:2502.02649 (2025)

arXiv 2025
[24]

In: Proceedings of the 1st Workshop on Data Contamination (CONDA), pp

Palavalli, M., Bertsch, A., Gormley, M.R.: A taxonomy for data contamination in large language models. In: Proceedings of the 1st Workshop on Data Contamination (CONDA), pp. 22–40 (2024)

2024
[25]

In: International Conference on Machine Learning, pp

Park, K., Zhou, T., D’Antoni, L.: Flexible and efficient grammar-constrained decoding. In: International Conference on Machine Learning, pp. 48262–48275. PMLR (2025)

2025
[26]

Advances in Neural Information Processing Systems37, 126544–126565 (2024)

Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: Large language model con- nected with massive apis. Advances in Neural Information Processing Systems37, 126544–126565 (2024)

2024
[27]

Scholak, T., Schucher, N., Bahdanau, D.: PICARD: Parsing incrementally for con- strained auto-regressive decoding from language models. In: M.F. Moens, X. Huang, L. Specia, S.W.t. Yih (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9895–9901. Association for Computational Lin- guistics, Online and Punta C...

work page doi:10.18653/v1/2021 2021
[28]

arXiv preprint arXiv:2408.11061 (2024)

Shorten, C., Pierse, C., Smith, T.B., Cardenas, E., Sharma, A., Trengrove, J., van Luijt, B.: Structuredrag: Json response formatting with large language models. arXiv preprint arXiv:2408.11061 (2024)

arXiv 2024
[29]

arXiv preprint arXiv:2501.05255 (2025)

Song, Y., Lothritz, C., Tang, X., Ezzini, S., Klein, J., Bissyand´ e, T.F., Boytsov, A., Ble, U., Goujon, A.: Callnavi: A study and challenge on function calling routing and invocation in large language models. arXiv preprint arXiv:2501.05255 (2025)

arXiv 2025
[30]

arXiv preprint arXiv:2408.02442 (2024)

Tam, Z.R., Wu, C.K., Tsai, Y.L., Lin, C.Y., Lee, H.y., Chen, Y.N.: Let me speak freely? a study on the impact of format restrictions on performance of large language models. arXiv preprint arXiv:2408.02442 (2024)

arXiv 2024
[31]

URLhttps://arxiv.org/abs/2505.09388

Team, Q.: Qwen3 technical report (2025). URLhttps://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025
[32]

arXiv preprint arXiv:2403.01632 (2024)

Ugare, S., Suresh, T., Kang, H., Misailovic, S., Singh, G.: Syncode: Llm generation with grammar augmentation. arXiv preprint arXiv:2403.01632 (2024)

arXiv 2024
[33]

Production & Manufacturing Research 12(1), 2375296 (2024)

Uygun, Y., Momodu, V.: Local large language models to simplify requirement engi- neering documents in the automotive industry. Production & Manufacturing Research 12(1), 2375296 (2024)

2024
[34]

Advances in Neural Information Processing Systems36, 65030–65055 (2023)

Wang, B., Wang, Z., Wang, X., Cao, Y., A Saurous, R., Kim, Y.: Grammar prompting for domain-specific language generation with large language models. Advances in Neural Information Processing Systems36, 65030–65055 (2023)

2023
[35]

arXiv preprint arXiv:2505.04016 (2025)

Wang, D.Y.B., Shen, Z., Mishra, S.S., Xu, Z., Teng, Y., Ding, H.: Slot: Structuring the output of large language models. arXiv preprint arXiv:2505.04016 (2025)

arXiv 2025
[36]

arXiv preprint arXiv:2508.11126 (2025)

Wang, H., Gong, J., Zhang, H., Xu, J., Wang, Z.: Ai agentic programming: A survey of techniques, challenges, and opportunities. arXiv preprint arXiv:2508.11126 (2025)

arXiv 2025
[37]

arXiv preprint arXiv:2307.09702 (2023)

Willard, B.T., Louf, R.: Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702 (2023)

Pith/arXiv arXiv 2023
[38]

URLhttps://github

Wind, J.: Proxy Structuring Engine: Guaranteed Structured Output from Language Models via Runtime Hierarchical State Machine Enforcement. URLhttps://github. com/TheProxyCompany/proxy-structuring-engine
[39]

arXiv preprint arXiv:2507.04504 (2025)

Xiong, Z., Cai, Y., Li, Z., Wang, Y.: Unveiling the potential of diffusion large language model in controllable generation. arXiv preprint arXiv:2507.04504 (2025)

arXiv 2025
[40]

arXiv preprint arXiv:2506.03691 (2025)

Xu, W., Luo, J., Huang, T., Sui, K., Geng, J., Ma, Q., Akasaka, I., Shi, X., Tang, J., Cai, P.: Logsage: An llm-based framework for ci/cd failure detection and remediation with industrial validation. arXiv preprint arXiv:2506.03691 (2025)

arXiv 2025
[41]

Yan, F., Mao, H., Ji, C.C.J., Zhang, T., Patil, S.G., Stoica, I., Gonzalez, J.E.: Berkeley function calling leaderboard.https://gorilla.cs.berkeley.edu/blogs/8_berkeley_ function_calling_leaderboard.html(2024)

2024
[42]

arXiv preprint arXiv:2505.20139 (2025) Empirical Study for Structured Output Control of LLM for SE 35

Yang, J., Jiang, D., He, L., Siu, S., Zhang, Y., Liao, D., Li, Z., Zeng, H., Jia, Y., Wang, H., et al.: Structeval: Benchmarking llms’ capabilities to generate structural outputs. arXiv preprint arXiv:2505.20139 (2025) Empirical Study for Structured Output Control of LLM for SE 35

Pith/arXiv arXiv 2025
[43]

arXiv preprint arXiv:1809.08887 (2018)

Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., et al.: Spider: A large-scale human-labeled dataset for complex and cross- domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887 (2018)

Pith/arXiv arXiv 2018
[44]

arXiv preprint arXiv:2406.15877 (2024)

Zhuo, T.Y., Vu, M.C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I.N.B., Zhan, H., He, J., Paul, I., et al.: Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877 (2024)

Pith/arXiv arXiv 2024

[1] [1]

arXiv preprint arXiv:2510.22620 (2025)

Bazinska, J., Mathys, M., Casucci, F., Rojas-Carulla, M., Davies, X., Souly, A., Pfister, N.: Breaking agent backbones: Evaluating the security of backbone llms in ai agents. arXiv preprint arXiv:2510.22620 (2025)

arXiv 2025

[2] [2]

arXiv preprint arXiv:2403.06988 (2024)

Beurer-Kellner, L., Fischer, M., Vechev, M.: Guiding llms the right way: Fast, non- invasive constrained generation. arXiv preprint arXiv:2403.06988 (2024)

arXiv 2024

[3] [3]

arXiv preprint arXiv:2502.14425 (2025)

Cheng, Y., Chang, Y., Wu, Y.: A survey on data contamination for large language models. arXiv preprint arXiv:2502.14425 (2025)

arXiv 2025

[4] [4]

Advances in Neural Information Processing Systems 37, 92420–92464 (2024)

Dekoninck, J., M¨ uller, M.N., Vechev, M.: Constat: Performance-based contamination detection in large language models. Advances in Neural Information Processing Systems 37, 92420–92464 (2024)

2024

[5] [5]

arXiv preprint arXiv:2411.15100 (2024)

Dong, Y., Ruan, C.F., Cai, Y., Lai, R., Xu, Z., Zhao, Y., Chen, T.: Xgrammar: Flexible and efficient structured generation engine for large language models. arXiv preprint arXiv:2411.15100 (2024)

arXiv 2024

[6] [6]

In: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software En- gineering (ICSE-FoSE), pp

Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., Zhang, J.M.: Large language models for software engineering: Survey and open problems. In: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software En- gineering (ICSE-FoSE), pp. 31–53. IEEE (2023)

2023

[7] [7]

In: 13th Symposium on Languages, Applications and Technologies (SLATE 2024), pp

Faria, D., Baptista, T.J., Henriques, P.R.: Upgrade of lark compiler generator to support attribute grammars. In: 13th Symposium on Languages, Applications and Technologies (SLATE 2024), pp. 7–1. Schloss Dagstuhl–Leibniz-Zentrum f¨ ur Informatik (2024)

2024

[8] [8]

Gat, N., contributors: Lm format enforcer: Enforce the output format (json schema, regex etc) of a language model.https://github.com/noamgat/lm-format-enforcer

[9] [9]

arXiv preprint arXiv:2501.10868 (2025)

Geng, S., Cooper, H., Moskal, M., Jenkins, S., Berman, J., Ranchin, N., West, R., Horvitz, E., Nori, H.: Json-schemabench: A rigorous benchmark of structured outputs for language models. arXiv preprint arXiv:2501.10868 (2025)

arXiv 2025

[10] [10]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp

Geng, S., Josifoski, M., Peyrard, M., West, R.: Grammar-constrained decoding for struc- tured nlp tasks without finetuning. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10932–10952 (2023)

2023

[11] [11]

Guidance AI: llgtrt: LLM Guidance TensorRT.https://github.com/guidance-ai/ llgtrt

[12] [12]

com/guidance-ai/guidance(2023)

Guidance AI: Guidance: A language model programming framework.https://github. com/guidance-ai/guidance(2023)

2023

[13] [13]

arXiv preprint arXiv:2401.04088 (2024)

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

Pith/arXiv arXiv 2024

[14] [14]

Proceedings of Machine Learning and Systems5, 457–476 (2023)

Kuchnik, M., Smith, V., Amvrosiadis, G.: Validating large language models with relm. Proceedings of Machine Learning and Systems5, 457–476 (2023)

2023

[15] [15]

Laiyer.ai: LLM Guard: The Security Toolkit for Large Language Models.https:// llm-guard.com/

[16] [16]

URLhttps://www.langchain.com/

LangChain: Langchain official website (2025). URLhttps://www.langchain.com/

2025

[17] [17]

ISPRS International Journal of Geo-Information13(11), 405 (2024)

Li, D., Zhao, Y., Wang, Z., Jung, C., Zhang, Z.: Large language model-driven structured output: A comprehensive benchmark and spatial data generation framework. ISPRS International Journal of Geo-Information13(11), 405 (2024)

2024

[18] [18]

arXiv preprint arXiv:2412.19437 (2024)

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

Pith/arXiv arXiv 2024

[19] [19]

In: Extended Abstracts of the CHI Conference on Human Factors in Computing Sys- tems, pp

Liu, M.X., Liu, F., Fiannaca, A.J., Koo, T., Dixon, L., Terry, M., Cai, C.J.: ‘we need structured output’: Towards user-centered constraints on large language model output. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Sys- tems, pp. 1–9 (2024)

2024

[20] [20]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Lu, Y., Li, H., Cong, X., Zhang, Z., Wu, Y., Lin, Y., Liu, Z., Liu, F., Sun, M.: Learning to generate structured output with schema reinforcement learning. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4905–4918 (2025)

2025

[21] [21]

In: Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025), pp

Margiotta, D., Croce, D., Basili, R.: Evaluating large language models on wikipedia graph navigation: Insights from the wikigame. In: Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025), pp. 659–669 (2025) 34 Yewei Song 1 et al

2025

[22] [22]

arXiv preprint arXiv:2309.13638 (2023)

McCoy, R.T., Yao, S., Friedman, D., Hardy, M., Griffiths, T.L.: Embers of autoregres- sion: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638 (2023)

arXiv 2023

[23] [23]

arXiv preprint arXiv:2502.02649 (2025)

Mitchell, M., Ghosh, A., Luccioni, A.S., Pistilli, G.: Fully autonomous ai agents should not be developed. arXiv preprint arXiv:2502.02649 (2025)

arXiv 2025

[24] [24]

In: Proceedings of the 1st Workshop on Data Contamination (CONDA), pp

Palavalli, M., Bertsch, A., Gormley, M.R.: A taxonomy for data contamination in large language models. In: Proceedings of the 1st Workshop on Data Contamination (CONDA), pp. 22–40 (2024)

2024

[25] [25]

In: International Conference on Machine Learning, pp

Park, K., Zhou, T., D’Antoni, L.: Flexible and efficient grammar-constrained decoding. In: International Conference on Machine Learning, pp. 48262–48275. PMLR (2025)

2025

[26] [26]

Advances in Neural Information Processing Systems37, 126544–126565 (2024)

Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: Large language model con- nected with massive apis. Advances in Neural Information Processing Systems37, 126544–126565 (2024)

2024

[27] [27]

Scholak, T., Schucher, N., Bahdanau, D.: PICARD: Parsing incrementally for con- strained auto-regressive decoding from language models. In: M.F. Moens, X. Huang, L. Specia, S.W.t. Yih (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9895–9901. Association for Computational Lin- guistics, Online and Punta C...

work page doi:10.18653/v1/2021 2021

[28] [28]

arXiv preprint arXiv:2408.11061 (2024)

Shorten, C., Pierse, C., Smith, T.B., Cardenas, E., Sharma, A., Trengrove, J., van Luijt, B.: Structuredrag: Json response formatting with large language models. arXiv preprint arXiv:2408.11061 (2024)

arXiv 2024

[29] [29]

arXiv preprint arXiv:2501.05255 (2025)

Song, Y., Lothritz, C., Tang, X., Ezzini, S., Klein, J., Bissyand´ e, T.F., Boytsov, A., Ble, U., Goujon, A.: Callnavi: A study and challenge on function calling routing and invocation in large language models. arXiv preprint arXiv:2501.05255 (2025)

arXiv 2025

[30] [30]

arXiv preprint arXiv:2408.02442 (2024)

Tam, Z.R., Wu, C.K., Tsai, Y.L., Lin, C.Y., Lee, H.y., Chen, Y.N.: Let me speak freely? a study on the impact of format restrictions on performance of large language models. arXiv preprint arXiv:2408.02442 (2024)

arXiv 2024

[31] [31]

URLhttps://arxiv.org/abs/2505.09388

Team, Q.: Qwen3 technical report (2025). URLhttps://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025

[32] [32]

arXiv preprint arXiv:2403.01632 (2024)

Ugare, S., Suresh, T., Kang, H., Misailovic, S., Singh, G.: Syncode: Llm generation with grammar augmentation. arXiv preprint arXiv:2403.01632 (2024)

arXiv 2024

[33] [33]

Production & Manufacturing Research 12(1), 2375296 (2024)

Uygun, Y., Momodu, V.: Local large language models to simplify requirement engi- neering documents in the automotive industry. Production & Manufacturing Research 12(1), 2375296 (2024)

2024

[34] [34]

Advances in Neural Information Processing Systems36, 65030–65055 (2023)

Wang, B., Wang, Z., Wang, X., Cao, Y., A Saurous, R., Kim, Y.: Grammar prompting for domain-specific language generation with large language models. Advances in Neural Information Processing Systems36, 65030–65055 (2023)

2023

[35] [35]

arXiv preprint arXiv:2505.04016 (2025)

Wang, D.Y.B., Shen, Z., Mishra, S.S., Xu, Z., Teng, Y., Ding, H.: Slot: Structuring the output of large language models. arXiv preprint arXiv:2505.04016 (2025)

arXiv 2025

[36] [36]

arXiv preprint arXiv:2508.11126 (2025)

Wang, H., Gong, J., Zhang, H., Xu, J., Wang, Z.: Ai agentic programming: A survey of techniques, challenges, and opportunities. arXiv preprint arXiv:2508.11126 (2025)

arXiv 2025

[37] [37]

arXiv preprint arXiv:2307.09702 (2023)

Willard, B.T., Louf, R.: Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702 (2023)

Pith/arXiv arXiv 2023

[38] [38]

URLhttps://github

Wind, J.: Proxy Structuring Engine: Guaranteed Structured Output from Language Models via Runtime Hierarchical State Machine Enforcement. URLhttps://github. com/TheProxyCompany/proxy-structuring-engine

[39] [39]

arXiv preprint arXiv:2507.04504 (2025)

Xiong, Z., Cai, Y., Li, Z., Wang, Y.: Unveiling the potential of diffusion large language model in controllable generation. arXiv preprint arXiv:2507.04504 (2025)

arXiv 2025

[40] [40]

arXiv preprint arXiv:2506.03691 (2025)

Xu, W., Luo, J., Huang, T., Sui, K., Geng, J., Ma, Q., Akasaka, I., Shi, X., Tang, J., Cai, P.: Logsage: An llm-based framework for ci/cd failure detection and remediation with industrial validation. arXiv preprint arXiv:2506.03691 (2025)

arXiv 2025

[41] [41]

Yan, F., Mao, H., Ji, C.C.J., Zhang, T., Patil, S.G., Stoica, I., Gonzalez, J.E.: Berkeley function calling leaderboard.https://gorilla.cs.berkeley.edu/blogs/8_berkeley_ function_calling_leaderboard.html(2024)

2024

[42] [42]

arXiv preprint arXiv:2505.20139 (2025) Empirical Study for Structured Output Control of LLM for SE 35

Yang, J., Jiang, D., He, L., Siu, S., Zhang, Y., Liao, D., Li, Z., Zeng, H., Jia, Y., Wang, H., et al.: Structeval: Benchmarking llms’ capabilities to generate structural outputs. arXiv preprint arXiv:2505.20139 (2025) Empirical Study for Structured Output Control of LLM for SE 35

Pith/arXiv arXiv 2025

[43] [43]

arXiv preprint arXiv:1809.08887 (2018)

Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., et al.: Spider: A large-scale human-labeled dataset for complex and cross- domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887 (2018)

Pith/arXiv arXiv 2018

[44] [44]

arXiv preprint arXiv:2406.15877 (2024)

Zhuo, T.Y., Vu, M.C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I.N.B., Zhan, H., He, J., Paul, I., et al.: Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877 (2024)

Pith/arXiv arXiv 2024