pith. sign in

arxiv: 2606.22578 · v1 · pith:G4AXSY3Nnew · submitted 2026-06-21 · 💻 cs.CL · cs.AI

Context-Aware Distillation and Ablation for Text2DSL

Pith reviewed 2026-06-26 10:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords context-aware distillationtext-to-dslpolkit rulesablation studystructured contextdsl generationsyntax validationllm distillation
0
0 comments X

The pith

Structured context makes text-to-DSL generation robust where prompt-only baselines fail on harder data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that context-aware distillation, supplying a teacher model with explicit BNF grammar, API specification, and identifier vocabulary, produces a scaled corpus of 10,073 verified natural-language-to-Polkit-rule pairs at near-perfect validity. Ablation across eight context combinations on a harder test corpus demonstrates that baseline performance collapses while context-enhanced performance holds steady, establishing structured context as load-bearing. Vocabulary contributes the largest gain to semantic quality while API and BNF drive structural validity. The work therefore treats context not as an add-on but as the mechanism that prevents degradation under increased difficulty.

Core claim

Context-aware distillation scales the PolkitBench corpus from 4,204 to 10,073 pairs at 100.0% AST validity and 99.7% runtime pass rate. Factorial ablation on the harder corpus shows baseline syntax validity falling from 97.6% to 58.5% while context mode falls only from 98.6% to 97.4%, with vocabulary giving the largest combined-score lift (+0.198) and API (+24.7 pp) plus BNF (+22.3 pp) giving the largest structural-validity lifts; the full context condition C7 is strongest on every metric.

What carries the argument

Structured context of BNF grammar, API specification, and closed identifier vocabulary supplied to the teacher LLM during distillation, followed by two-tier AST and runtime verification.

If this is right

  • The full context condition C7 outperforms every partial context on all metrics.
  • Vocabulary supplies the largest semantic-quality improvement (+0.198 combined score).
  • API specification supplies the largest structural-validity gain (+24.7 percentage points).
  • BNF grammar supplies the next largest structural-validity gain (+22.3 percentage points).
  • Context-enhanced generation maintains high performance when test-corpus difficulty increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation pattern could generate verified rules for other policy or security DSLs that have formal grammars.
  • Constraining the teacher with vocabulary and grammar may reduce hallucinated identifiers across broader LLM code-generation settings.
  • Repeating the ablation on different model families would test whether vocabulary remains the dominant semantic contributor.

Load-bearing premise

The two-tier verification pipeline using esprima for AST and polkitd/pkcheck for runtime correctly identifies all valid and secure rules without systematic misses or false positives.

What would settle it

Finding a generated rule that passes both AST validation and runtime acceptance yet violates the intended security policy or introduces an undetected vulnerability when executed by the production daemon.

Figures

Figures reproduced from arXiv: 2606.22578 by Alexander M. Nazimov, Alexander V. Kozachok, Shamil G. Magomedov.

Figure 1
Figure 1. Figure 1: GigaChat-10B on PolkitBench-v1 vs. PolkitBench-v2 in the baseline and con￾text modes. The harder PolkitBench-v2 collapses the baseline mode on every metric while the context mode degrades only marginally. 0.829 → 0.788 (−0.041), Jaccard 0.736 → 0.660 (−0.076), Combined 0.801 → 0.750 (−0.051). The asymmetry of the two degradation profiles — catastrophic for baseline, marginal for context — is the main quant… view at source ↗
Figure 2
Figure 2. Figure 2: Context-induced gain on PolkitBench-v1 vs. PolkitBench-v2 (GigaChat-10B). The gain is larger on the harder corpus on every metric. To pinpoint which of the four binary tests drives the v1 regression we decom￾pose the 11.7 pp drop. The first three tests — code extraction, syntactic validity and structural validity — equal or exceed the baseline in the context mode: Syntax rises from 97.57% to 98.60% and Str… view at source ↗
Figure 3
Figure 3. Figure 3: Per-component ablation of structured context on GigaChat-10B and PolkitBench-v2. The full context C7 is uniformly best; the strongest partial condi￾tions C5 and C6 both contain the vocabulary [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
read the original abstract

We extend our prior work on Text2DSL automatic generation of domain-specific language (DSL) code from natural language descriptions along two complementary axes. First, we replace prompt-only synthetic generation with context-aware distillation, in which a teacher large language model (DeepSeek-V4-Flash) operates under an explicitly defined structured context comprising a BNF grammar, an API specification, and a closed identifier vocabulary; the resulting corpus is verified by a two-tier pipeline combining AST validation through esprima and runtime acceptance through the production polkitd daemon and the pkcheck client. This scales the verified PolkitBench corpus from 4,204 to 10,073 natural-language-to-Polkit-rule pairs at 100.0% AST validity and 99.7% runtime pass rate. Second, we conduct the per-component factorial ablation of structured context that was identified as future work in the precursor study: eight conditions C0-C7 are evaluated on GigaChat-10B-A1.8B with the new corpus. Three findings emerge. (i) The new harder corpus collapses the baseline mode (Syntax Valid 97.6% -> 58.5%, Combined Score 0.482 -> 0.252), whereas the context-enhanced mode degrades only marginally (Syntax 98.6% -> 97.4%, Combined 0.801 -> 0.750), confirming that structured context is not a cosmetic improvement but a load-bearing mechanism. (ii) The best absolute condition is the full context C7 across all metrics, while the strongest partial conditions (C5 = BNF + Vocabulary, C6 = API + Vocabulary) both contain the vocabulary. (iii) A Shapley-style decomposition assigns the largest semantic-quality effect to the vocabulary (Combined +0.198), the largest structural-validity effects to API (+24.7 pp) and BNF (+22.3 pp).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper extends prior Text2DSL work by introducing context-aware distillation, where a teacher LLM (DeepSeek-V4-Flash) generates Polkit rules under explicit structured context (BNF grammar, API specification, closed identifier vocabulary). This produces an expanded PolkitBench corpus of 10,073 NL-to-rule pairs at 100% AST validity (esprima) and 99.7% runtime pass rate (polkitd/pkcheck). A factorial ablation (C0-C7) on GigaChat-10B-A1.8B with the new corpus shows the baseline mode collapsing on the harder corpus (Syntax Valid 97.6%→58.5%, Combined 0.482→0.252) while context-enhanced modes degrade only marginally, with full context (C7) best and Shapley decomposition attributing largest effects to vocabulary (+0.198 Combined), API (+24.7 pp structural), and BNF (+22.3 pp structural).

Significance. If the two-tier verification pipeline is reliable, the work provides concrete evidence that structured context is load-bearing for DSL generation robustness rather than cosmetic, with quantitative per-component attribution via ablation and Shapley-style analysis on a scaled, verified corpus. The explicit scaling from 4,204 to 10,073 pairs and the differential degradation on the harder corpus are notable empirical contributions.

major comments (2)
  1. [Abstract] Abstract and methods: The central claim—that the new corpus exposes context as load-bearing because baseline syntax-valid rate collapses from 97.6% to 58.5% while context-enhanced drops only from 98.6% to 97.4%—depends entirely on the two-tier pipeline (esprima AST + polkitd/pkcheck runtime) correctly labeling semantic validity and security. No independent validation (human review of a sample, error analysis of false positives, or comparison against manual Polkit rules) is reported to confirm that the 99.7% runtime pass rate corresponds to human-judged correctness rather than undetected Polkit-specific semantic errors.
  2. [Ablation section] Ablation results: The Shapley-style decomposition reports specific attributions (vocabulary +0.198 Combined, API +24.7 pp, BNF +22.3 pp), but the exact computation method, base values, and whether it was performed on the new 10,073-pair corpus versus the prior one are not specified, making it impossible to verify if the attributions are robust to the verification pipeline.
minor comments (2)
  1. [Ablation section] The eight conditions C0-C7 are referenced but their exact definitions (which components are included in each) are not listed in the abstract; a table would improve clarity.
  2. [Results] No statistical significance tests or confidence intervals are mentioned for the reported deltas (e.g., 58.5% vs 97.4%).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the empirical contributions of the scaled corpus and ablation study. We address each major comment below and indicate planned revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract and methods: The central claim—that the new corpus exposes context as load-bearing because baseline syntax-valid rate collapses from 97.6% to 58.5% while context-enhanced drops only from 98.6% to 97.4%—depends entirely on the two-tier pipeline (esprima AST + polkitd/pkcheck runtime) correctly labeling semantic validity and security. No independent validation (human review of a sample, error analysis of false positives, or comparison against manual Polkit rules) is reported to confirm that the 99.7% runtime pass rate corresponds to human-judged correctness rather than undetected Polkit-specific semantic errors.

    Authors: We acknowledge that the manuscript does not report independent human validation of the two-tier pipeline beyond the automated checks. While runtime execution via the production polkitd daemon offers direct evidence of functional behavior, we agree that this leaves open the possibility of undetected semantic or security issues. In the revision we will add a human evaluation section: a random sample of 150 rules drawn from the 10,073-pair corpus will be reviewed by two domain experts (with inter-annotator agreement reported via Cohen’s kappa) against the original natural-language intent and Polkit security guidelines. This will be presented alongside the existing pipeline metrics. revision: yes

  2. Referee: [Ablation section] Ablation results: The Shapley-style decomposition reports specific attributions (vocabulary +0.198 Combined, API +24.7 pp, BNF +22.3 pp), but the exact computation method, base values, and whether it was performed on the new 10,073-pair corpus versus the prior one are not specified, making it impossible to verify if the attributions are robust to the verification pipeline.

    Authors: We agree the description of the Shapley-style decomposition is underspecified. The values were obtained on the new 10,073-pair corpus by treating the eight conditions (C0–C7) as a factorial design and applying the standard Shapley value formula for cooperative games, with C0 (no context) as the base value. We will insert a new subsection (and appendix with the full attribution matrix) that states the exact formula, confirms the corpus used, lists the base values for each metric, and notes that all attributions are computed after the two-tier verification filter. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper reports empirical ablation results (Syntax Valid rates, Combined Scores) measured on a new corpus of 10,073 pairs constructed via teacher-model distillation and verified by an external two-tier pipeline (esprima AST validation plus polkitd/pkcheck runtime acceptance at 99.7% pass rate). The sole reference to prior work notes only that the factorial ablation design was flagged as future work in the precursor study; this citation is not load-bearing for any metric or claim. No equations, self-definitions, fitted parameters renamed as predictions, or imported uniqueness results appear. All reported differentials are direct experimental outcomes on the verified corpus and GigaChat-10B model under the eight context conditions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim depends on the teacher model producing useful examples under the supplied context and on the verification pipeline being a faithful oracle for correctness; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The two-tier verification (esprima AST + polkitd/pkcheck runtime) accurately labels generated rules as correct or incorrect.
    Used to filter the distilled corpus and to compute all reported pass rates.
  • domain assumption DeepSeek-V4-Flash can generate higher-quality DSL examples when supplied with explicit BNF, API, and vocabulary context than without.
    Foundation of the context-aware distillation step that produces the 10,073-pair corpus.

pith-pipeline@v0.9.1-grok · 5891 in / 1415 out tokens · 39814 ms · 2026-06-26T10:19:17.398081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 12 canonical work pages

  1. [1]

    In: Proc

    Kozachok, A.V., Nazimov, A.A., Magomedov, Sh.G.: Text2DSL: LLM-based code generation for domain-specific languages. In: Proc. 30th Int. Conf. Knowledge- Based and Intelligent Information & Engineering Systems (KES 2026). Procedia Computer Science (2026)

  2. [2]

    https://www.freedesktop.org/software/ polkit/docs/latest/ (accessed 2026-05-12)

    freedesktop.org: Polkit reference manual. https://www.freedesktop.org/software/ polkit/docs/latest/ (accessed 2026-05-12)

  3. [3]

    NAI Labs Report 01-043, NAI Labs, Network Associates Inc

    Smalley, S., Vance, C., Salamon, W.: Implementing SELinux as a Linux security module. NAI Labs Report 01-043, NAI Labs, Network Associates Inc. (2001)

  4. [4]

    In: Proc

    Loscocco, P., Smalley, S.: Integrating flexible support for security policies into the Linux operating system. In: Proc. FREENIX Track, USENIX Annual Technical Conf., pp. 29–42 (2001)

  5. [5]

    https://www

    The Open Policy Agent Authors: OPA documentation. https://www. openpolicyagent.org/docs/latest/ (accessed 2026-05-12)

  6. [6]

    https://developer.hashicorp.com/ terraform/language (accessed 2026-05-12)

    HashiCorp: Terraform language documentation. https://developer.hashicorp.com/ terraform/language (accessed 2026-05-12)

  7. [7]

    In: 2019 24th International Conference on Engineering of Complex Computer Systems (ICECCS), pp

    Rahman, A., Parnin, C., Williams, L.: The seven sins: security smells in infrastruc- ture as code scripts. In: Proc. 41st Int. Conf. on Software Engineering (ICSE), pp. 164–175 (2019). doi:10.1109/ICSE.2019.00033 20 Kozachok, Nazimov, Magomedov

  8. [8]

    In: Proc

    Saavedra, N., Ferreira, J.F.: GLITCH: Automated polyglot security smell detection in infrastructure as code. In: Proc. 37th IEEE/ACM Int. Conf. on Automated Software Engineering (ASE), Art. No. 19, pp. 1–12 (2022). doi:10.1145/3551349. 3556945

  9. [9]

    ACM Computing Surveys 37(4), 316–344 (2005)

    Mernik, M., Heering, J., Sloane, A.M.: When and how to develop domain-specific languages. ACM Computing Surveys 37(4), 316–344 (2005). doi:10.1145/1118890. 1118892

  10. [10]

    Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

    Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., Zhang, Z., Radev, D.: Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and Text-to-SQL task. In: Proc. 2018 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 3911–3921 (2018). doi:10.18653/v1/D18-1425

  11. [11]

    In: Advances in Neural Information Processing Systems 36 (NeurIPS 2023), pp

    Pourreza, M., Rafiei, D.: DIN-SQL: decomposed in-context learning of Text-to- SQL with self-correction. In: Advances in Neural Information Processing Systems 36 (NeurIPS 2023), pp. 36339–36348 (2023)

  12. [12]

    In: Proc

    Li,H.,Zhang,J.,Li,C.,Chen,H.:RESDSQL:decouplingschemalinkingandskele- ton parsing for Text-to-SQL. In: Proc. 37th AAAI Conf. on Artificial Intelligence, pp. 13067–13075 (2023). doi:10.1609/aaai.v37i11.26535

  13. [13]

    2021 , address =

    Scholak, T., Schucher, N., Bahdanau, D.: PICARD: parsing incrementally for con- strained auto-regressive decoding from language models. In: Proc. 2021 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 9895–9901 (2021). doi:10.18653/v1/2021.emnlp-main.779

  14. [14]

    In: Proc

    Poesia, G., Polozov, O., Le, V., Tiwari, A., Soares, G., Meek, C., Gulwani, S.: Synchromesh: reliable code generation from pre-trained language models. In: Proc. 10th Int. Conf. on Learning Representations (ICLR) (2022)

  15. [15]

    In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp

    Geng, S., Josifoski, M., Peyrard, M., West, R.: Grammar-constrained decoding for structured NLP tasks without finetuning. In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 10932–10952 (2023). doi:10.18653/ v1/2023.findings-emnlp.733

  16. [16]

    In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp

    Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., Zhou, M.: CodeBERT: a pre-trained model for programming and natural languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1536–1547 (2020). doi:10.18653/v1/2020.findings-emnlp.139

  17. [17]

    In: Proc

    Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svy- atkovskiy, A., Fu, S., Tufano, M., Deng, S.K., Clement, C., Drain, D., Sundaresan, N., Yin, J., Jiang, D., Zhou, M.: GraphCodeBERT: pre-training code representa- tions with data flow. In: Proc. 9th Int. Conf. on Learning Representations (ICLR) (2021)

  18. [18]

    Joty, and Steven C

    Wang, Y., Wang, W., Joty, S., Hoi, S.C.H.: CodeT5: identifier-aware unified pre- trained encoder-decoder models for code understanding and generation. In: Proc. 2021 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 8696–8708 (2021). doi:10.18653/v1/2021.emnlp-main.685

  19. [19]

    Li, R., Allal, L.B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., et al.: StarCoder: may the source be with you! Transactions on Machine Learning Research (TMLR), 2023 (2023)

  20. [20]

    In: Proc

    Nijkamp,E.,Pang,B.,Hayashi,H.,Tu,L.,Wang,H.,Zhou,Y.,Savarese,S.,Xiong, C.: CodeGen: an open large language model for code with multi-turn program synthesis. In: Proc. 11th Int. Conf. on Learning Representations (ICLR) (2023)

  21. [21]

    In: Proc

    Jain, N., Vaidyanath, S., Iyer, S., Natarajan, N., Parthasarathy, S., Rajamani, S., Sharma, R.: Jigsaw: large language models meet program synthesis. In: Proc. 44th Context-Aware Distillation and Ablation for Text2DSL 21 Int. Conf. on Software Engineering (ICSE), pp. 1219–1231 (2022). doi:10.1145/ 3510003.3510203

  22. [22]

    In: NeurIPS Deep Learning and Representation Learning Workshop (2014)

    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NeurIPS Deep Learning and Representation Learning Workshop (2014)

  23. [23]

    In- ternational Journal of Computer Vision 129(6), 1789–1819 (2021)

    Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. In- ternational Journal of Computer Vision 129(6), 1789–1819 (2021). doi:10.1007/ s11263-021-01453-z

  24. [24]

    In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp

    Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tiny- BERT: distilling BERT for natural language understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4163–4174 (2020). doi:10.18653/v1/2020.findings-emnlp.372

  25. [25]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self-Instruct: aligning language models with self-generated instructions. In: Proc. 61st Annual Meeting of the Association for Computational Linguistics (ACL), Vol. 1: Long Papers, pp. 13484–13508 (2023). doi:10.18653/v1/2023.acl-long.754

  26. [26]

    In: Proc

    Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Lin, Q., Jiang, D.: WizardLM: empowering large pre-trained language models to follow complex instructions. In: Proc. 12th Int. Conf. on Learning Representations (ICLR) (2024)

  27. [27]

    In: Proc

    Luo,Z.,Xu,C.,Zhao,P.,Sun,Q.,Geng,X.,Hu,W.,Tao,C.,Ma,J.,Lin,Q.,Jiang, D.: WizardCoder: empowering code large language models with Evol-Instruct. In: Proc. 12th Int. Conf. on Learning Representations (ICLR) (2024)

  28. [28]

    In: Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 9459–9474 (2020)

  29. [29]

    24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models.In:AdvancesinNeuralInformationProcessingSystems35(NeurIPS2022), pp. 24824–24837 (2022)

  30. [30]

    Bleu: a method for automatic evaluation of machine translation

    Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluationofmachinetranslation.In:Proc.40thAnnualMeetingoftheAssociation for Computational Linguistics (ACL), pp. 311–318 (2002). doi:10.3115/1073083. 1073135

  31. [31]

    In: Proc

    Eghbali, A., Pradel, M.: CrystalBLEU: precisely and efficiently measuring the sim- ilarity of code. In: Proc. 37th IEEE/ACM Int. Conf. on Automated Software En- gineering (ASE), Art. No. 28, pp. 1–12 (2022). doi:10.1145/3551349.3556903

  32. [32]

    In: Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp

    Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. In: Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 1877–1901 (2020)