pith. the verified trust layer for science. sign in

arxiv: 2507.11687 · v4 · submitted 2025-07-15 · 💻 cs.SE · cs.CL· cs.LG

MetaLint: Easy-to-Hard Generalization for Code Linting

Pith reviewed 2026-05-19 04:02 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG
keywords code lintingmeta-learninggeneralizationbest practicessynthetic datainstruction followingPython Enhancement ProposalsF-score
0
0 comments X p. Extension

The pith

Models trained on synthetic data from simple linters generalize to detect harder, context-dependent code best practices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MetaLint as a way to train language models for code linting by treating it as following instructions about best practices given in natural language. Models learn from easy examples generated by automatic linters and then apply that to more difficult rules that lack such automatic tools. This leads to large performance gains on a new benchmark of hard practices drawn from Python Enhancement Proposals. A small model like Qwen3-4B shows a 2.7 times better F-score in detecting violations and matches larger models in localization.

Core claim

MetaLint formulates code linting as an instruction-following task where the model checks if code follows a provided natural language specification of best practices. Trained only on synthetic data from automatic linters, it generalizes to a human-curated set of harder best practices inspired by PEPs, achieving a 2.7x detection F-score gain from 25.9% to 70.4% for Qwen3-4B, along with 26.7% localization F-score.

What carries the argument

The instruction-following formulation of code linting against a variable natural language specification of best practices, which allows test-time control and generalization without retraining on specific rules.

If this is right

  • Performance gains hold across different programming languages, model families, and scales.
  • Models can enforce new or evolving best practices by simply changing the natural language description at test time.
  • Smaller models can match larger ones like o3-mini on localization tasks for these practices.
  • Releasing the code and benchmark enables further research on easy-to-hard generalization in code tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar meta-learning approaches could help models handle other evolving software engineering tasks beyond linting.
  • Reducing reliance on fixed linter sets might improve adaptability in AI coding assistants.
  • Testing on benchmarks with minimal overlap to training data would strengthen claims of true generalization.

Load-bearing premise

The gains on the human-curated benchmark come from learning generalizable patterns rather than from accidental overlap with the synthetic training data or from the specific prompts and metrics used.

What would settle it

Demonstrating that the hard best practices in the benchmark share substantial rules or examples with the synthetic linter data, or that changing the evaluation prompts eliminates the performance improvement.

Figures

Figures reproduced from arXiv: 2507.11687 by Atharva Naik, Carolyn Rose, Daniel Fried, Darsh Agrawal, Dhakshin Govindarajan, Lawanya Baghel, Yiqing Xie.

Figure 1
Figure 1. Figure 1: METALINT: (1) Synthetic data generation with linters/tools, (2) Supervised Instruction Fine-Tuning (SFT) on this data, and (3) Verifiable Reward Model derived from the linter. 4 EXPERIMENTS 4.1 EVALUATION METRICS We evaluate the LLM’s ability to detect idiom violations through two tasks: detection, which as￾sesses whether a given idiom is violated in a code file, and localization, which evaluates whether t… view at source ↗
Figure 2
Figure 2. Figure 2: METALINT: Preference Optimization using reward model: (4) Rejection Sampling Direct Preference Optimization (RS-DPO), and (5) Rejection Sampling Supervised Fine-Tuning (RS-SFT). violating line numbers. To handle potential class imbalance, we use macro-averaging across idioms and exclude NO VIOLATION as a class to penalize models that only predict NO VIOLATIONS FOUND (such models will score zero on all dete… view at source ↗
Figure 3
Figure 3. Figure 3: ID: In-Domain, NeT: Near Transfer, FaT: Far Transfer. [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of comparative failures of the CoT M [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
read the original abstract

Large language models excel at code generation but struggle with code linting, particularly in generalizing to unseen or evolving best practices beyond those observed during training. We introduce MetaLint, a meta-learning framework that formulates code linting as an instruction-following task, where a model evaluates whether code adheres to a natural language specification of best practices. In contrast to prior work that trains models to detect violations from a fixed set of best practices, MetaLint evaluates code against a provided natural language specification, enabling test-time control over which practices to enforce and generalization to unseen or evolving rules without retraining. We demonstrate that models trained solely on synthetic data generated from automatic linters still generalize to harder, context-dependent best practices for which such linters are not available. To evaluate generalization beyond such easy signals, we introduce a human-curated benchmark of hard best practices inspired by Python Enhancement Proposals (PEPs). On this benchmark, MetaLint substantially improves performance without explicit fine-tuning on target best practices and exhibits strong, easy-to-hard generalization. Qwen3-4B achieves a 2.7x detection F-score gain (25.9% -> 70.4%), the highest recall, and a 26.7% localization F-score, matching larger models such as o3-mini. These gains generalize across programming languages, model families, scales, reasoning settings, and linter sources. We release the code and benchmark to support reproducibility and future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MetaLint, a meta-learning framework that formulates code linting as an instruction-following task where models evaluate code against natural language best-practice specifications. Models are trained exclusively on synthetic data generated from automatic linters and evaluated for generalization on a new human-curated benchmark of harder, context-dependent practices inspired by Python Enhancement Proposals (PEPs). The authors report a 2.7x detection F-score gain for Qwen3-4B (25.9% to 70.4%), highest recall, 26.7% localization F-score matching larger models, and generalization across languages, model families, scales, and linter sources, with code and benchmark released.

Significance. If the reported easy-to-hard generalization holds after addressing potential distributional overlap, the work would be significant for enabling flexible, test-time controllable code linting without retraining on evolving rules. The open release of the benchmark and implementation is a clear strength supporting reproducibility.

major comments (2)
  1. [Abstract] Abstract: The central claim of generalization to 'harder, context-dependent best practices for which such linters are not available' is load-bearing for the meta-instruction-following interpretation. Without an explicit overlap audit, rule-by-rule mapping, or similarity analysis between the automatic-linter rules used for synthetic training data and the human-curated PEP rules in the test benchmark, it remains possible that the 2.7x F-score gain reflects learning broader linter-adjacent patterns rather than true out-of-distribution meta-generalization.
  2. [Evaluation] Evaluation / Experiments: The abstract reports concrete F-score numbers and cross-language generalization, but the full experimental protocol (prompt templates, exact localization metric definition, and statistical tests for robustness) is not detailed. This makes it difficult to confirm that gains are stable to reasonable variations in evaluation setup.
minor comments (2)
  1. [Abstract] The abstract could more precisely define what constitutes 'easy' versus 'hard' signals in the benchmarks to clarify the easy-to-hard generalization narrative.
  2. Consider adding a limitations section discussing potential failure modes, such as sensitivity to prompt phrasing or performance on very long code contexts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and the opportunity to clarify our work. We address each major comment below and will make targeted revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of generalization to 'harder, context-dependent best practices for which such linters are not available' is load-bearing for the meta-instruction-following interpretation. Without an explicit overlap audit, rule-by-rule mapping, or similarity analysis between the automatic-linter rules used for synthetic training data and the human-curated PEP rules in the test benchmark, it remains possible that the 2.7x F-score gain reflects learning broader linter-adjacent patterns rather than true out-of-distribution meta-generalization.

    Authors: We appreciate the referee's emphasis on rigorously establishing the out-of-distribution nature of the generalization. The synthetic training data is generated exclusively from automatic linters that implement straightforward, checkable rules, whereas the human-curated benchmark consists of nuanced, context-dependent practices drawn from PEPs that lack direct automated implementations. To strengthen this distinction, we will add an explicit overlap analysis section that includes (1) a rule-by-rule mapping between training linter rules and test specifications where feasible and (2) semantic similarity metrics (e.g., embedding-based cosine similarity) between the natural-language rule descriptions. This addition will quantify any potential distributional overlap and further support the meta-generalization claim. revision: yes

  2. Referee: [Evaluation] Evaluation / Experiments: The abstract reports concrete F-score numbers and cross-language generalization, but the full experimental protocol (prompt templates, exact localization metric definition, and statistical tests for robustness) is not detailed. This makes it difficult to confirm that gains are stable to reasonable variations in evaluation setup.

    Authors: We agree that greater detail on the evaluation protocol will enhance reproducibility and allow readers to assess robustness. Although the current manuscript outlines the high-level setup, we will expand the Experiments section to include the complete prompt templates, a precise mathematical definition of the localization F-score (including how spans are matched and scored), and results from statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) across multiple runs. These additions will confirm that the reported gains remain stable under reasonable variations in prompting and evaluation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical generalization on held-out human-curated benchmark

full rationale

The paper's central claim is an empirical performance delta (2.7x F-score gain on PEP-inspired hard practices) obtained by training solely on synthetic data from automatic linters and testing on a separately human-curated benchmark. No equations, fitted parameters, or self-citation chains are invoked to derive the generalization result; the methodology is a standard train-on-synthetic / evaluate-on-held-out setup whose outcome is not forced by construction from the training distribution. The derivation chain is therefore self-contained and externally falsifiable via the released benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard supervised fine-tuning assumptions.

pith-pipeline@v0.9.0 · 5816 in / 1097 out tokens · 54775 ms · 2026-05-19T04:02:49.355939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 13 internal anchors

  1. [1]

    URL https://www.lintrule.com/

    2023. URL https://www.lintrule.com/

  2. [2]

    Efficient model-agnostic alignment via bayesian persuasion

    Fengshuo Bai, Mingzhi Wang, Zhaowei Zhang, Boyuan Chen, Yinda Xu, Ying Wen, and Yaodong Yang. Efficient model-agnostic alignment via bayesian persuasion. ArXiv, abs/2405.18718, 2024. URL https://api.semanticscholar.org/CorpusId:270094634

  3. [3]

    Language Models are Few-Shot Learners

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Ma teusz Litwin, Scott Gray,...

  4. [4]

    Bigo (bench)--can llms generate code with controlled time and space complexity? arXiv preprint arXiv:2503.15242, 2025

    Pierre Chambon, Baptiste Roziere, Benoit Sagot, and Gabriel Synnaeve. Bigo (bench)--can llms generate code with controlled time and space complexity? arXiv preprint arXiv:2503.15242, 2025

  5. [5]

    Instruction diversity drives generalization to unseen tasks

    Francois Charton, Justin Wang, and Dylan Zhang. Instruction diversity drives generalization to unseen tasks. ArXiv, abs/2402.10891, 2024. URL https://api.semanticscholar.org/CorpusId:267740368

  6. [6]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, W. Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, S. Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, J. Dean, J...

  7. [7]

    Pep 506 -- adding a secrets module to the standard library

    Steven D'Aprano. Pep 506 -- adding a secrets module to the standard library. https://peps.python.org/pep-0506/, 2017. Accessed: 2025-06-26

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  9. [9]

    Lintllm: An open-source verilog linting framework based on large language models, 2025

    Zhigang Fang, Renzhi Chen, Zhijie Yang, Yang Guo, Huadong Dai, and Lei Wang. Lintllm: An open-source verilog linting framework based on large language models, 2025. URL https://arxiv.org/abs/2502.10815

  10. [11]

    Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow

    Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow. Differentiable programs with neural libraries. In International Conference on Machine Learning, 2016. URL https://api.semanticscholar.org/CorpusId:15016881

  11. [12]

    Learning instructions with unlabeled data for zero-shot cross-task generalization

    Yuxian Gu, Pei Ke, Xiaoyan Zhu, and Minlie Huang. Learning instructions with unlabeled data for zero-shot cross-task generalization. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusId:252918165

  12. [13]

    Guiding through complexity: What makes good supervision for hard math reasoning tasks? In unknown, 2024

    Xuan He, Da Yin, and Nanyun Peng. Guiding through complexity: What makes good supervision for hard math reasoning tasks? In unknown, 2024. URL https://api.semanticscholar.org/CorpusId:278775190

  13. [14]

    Code linting using language models

    Darren Holden and Nafiseh Kahani. Code linting using language models. arXiv preprint arXiv:2406.19508, 2024

  14. [15]

    Beyond single-task: Robust multi-task length generalization for llms

    Yi Hu, Shijia Kang, Haotong Yang, Haotian Xu, and Muhan Zhang. Beyond single-task: Robust multi-task length generalization for llms. In unknown, 2025. URL https://api.semanticscholar.org/CorpusId:276408040

  15. [16]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

  16. [17]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  17. [18]

    S. Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O'Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke S. Zettlemoyer, and Veselin Stoyanov. Opt-iml: Scaling language model instruction meta learning through the lens of general...

  18. [19]

    Combining large language models with static analyzers for code review generation

    Imen Jaoua, Oussama Ben Sghaier, and Houari Sahraoui. Combining large language models with static analyzers for code review generation. arXiv preprint arXiv:2502.06633, 2025

  19. [20]

    Jiang, B

    K. Jiang, B. Jin, and P. Nie. CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories . In 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pp.\ 441--445, Ottawa, ON, Canada, 2025 a . doi:10.1109/MSR66628.2025.00075

  20. [21]

    Enhancing high-quality code generation in large language models with comparative prefix-tuning

    Yuan Jiang, Yujian Zhang, Liang Lu, Christoph Treude, Xiaohong Su, Shan Huang, and Tiantian Wang. Enhancing high-quality code generation in large language models with comparative prefix-tuning. arXiv preprint arXiv:2503.09020, 2025 b

  21. [22]

    Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models

    Saeed Khaki, JinJin Li, Lan Ma, Liu Yang, and Prathap Ramachandra. Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models. arXiv preprint arXiv:2402.10038, 2024

  22. [23]

    Understanding the effectiveness of large language models in detecting security vulnerabilities

    Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. Understanding the effectiveness of large language models in detecting security vulnerabilities. arXiv preprint arXiv:2311.16169, 2023

  23. [24]

    StarCoder 2 and The Stack v2: The Next Generation

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024

  24. [25]

    Cross-task generalization via natural language crowdsourcing instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Annual Meeting of the Association for Computational Linguistics, 2021 a . URL https://api.semanticscholar.org/CorpusID:237421373

  25. [26]

    Cross-task generalization via natural language crowdsourcing instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Annual Meeting of the Association for Computational Linguistics, 2021 b . URL https://api.semanticscholar.org/CorpusId:237421373

  26. [27]

    Common weakness enumeration (cwe)

    MITRE Corporation . Common weakness enumeration (cwe). https://cwe.mitre.org/, 2024. Accessed: 2025-06-26

  27. [28]

    Crscore: Grounding automated evaluation of code review comments in code claims and smells

    Atharva Naik, Marcus Alenius, Daniel Fried, and Carolyn Rose. Crscore: Grounding automated evaluation of code review comments in code claims and smells. arXiv preprint arXiv:2409.19801, 2024

  28. [29]

    Crosscodebench: Benchmarking cross-task generalization of source code models

    Changan Niu, Chuanyi Li, Vincent Ng, and Bin Luo. Crosscodebench: Benchmarking cross-task generalization of source code models. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp.\ 537--549, 2023. URL https://api.semanticscholar.org/CorpusId:256662301

  29. [30]

    Openai o3 and o4‑mini system card

    OpenAI. Openai o3 and o4‑mini system card. Technical report, OpenAI, 2025. Compact reasoning models with tool use, image analysis, and code capabilities

  30. [31]

    GPT-4.1 system card

    OpenAI . GPT-4.1 system card. Technical report, OpenAI, San Francisco, CA, April 2025. URL https://openai.com/index/gpt-4-1/. Launch of GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano via API; improvements in coding, instruction following, long-context capacity, and efficiency

  31. [32]

    Kalathil, and Shuiwang Ji

    Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, D. Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning. In unknown, 2025. URL https://api.semanticscholar.org/CorpusId:279251658

  32. [33]

    How many data samples is an additional instruction worth? ArXiv, abs/2203.09161, 2022

    Ravsehaj Singh Puri, Swaroop Mishra, Mihir Parmar, and Chitta Baral. How many data samples is an additional instruction worth? ArXiv, abs/2203.09161, 2022. URL https://api.semanticscholar.org/CorpusId:247518570

  33. [34]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 0 53728--53741, 2023

  34. [35]

    Ai-powered code review with llms: Early results

    Zeeshan Rasheed, Malik Abdul Sami, Muhammad Waseem, Kai-Kristian Kemell, Xiaofeng Wang, Anh Nguyen, Kari Syst \"a , and Pekka Abrahamsson. Ai-powered code review with llms: Early results. arXiv preprint arXiv:2404.18496, 2024

  35. [36]

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan D. Chang, Mike Tian-Jian Jiang, Han Wang...

  36. [37]

    Can large reasoning models self-train? ArXiv, abs/2505.21444, 2025

    Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, and Andrea Zanette. Can large reasoning models self-train? ArXiv, abs/2505.21444, 2025. URL https://api.semanticscholar.org/CorpusId:278911518

  37. [38]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  38. [40]

    Easy-to-hard generalization: Scalable alignment beyond human supervision

    Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, and Chuang Gan. Easy-to-hard generalization: Scalable alignment beyond human supervision. arXiv preprint arXiv:2403.09472, 2024 b

  39. [41]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

  40. [42]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  41. [43]

    Gptlint, 4 2024

    Scott Silvi Travis Fischer. Gptlint, 4 2024. URL https://github.com/gptlint/gptlint

  42. [44]

    Ai-assisted assessment of coding practices in modern code review

    Manushree Vijayvergiya, Ma gorzata Salawa, Ivan Budiseli \'c , Dan Zheng, Pascal Lamblin, Marko Ivankovi \'c , Juanjo Carin, Mateusz Lewko, Jovan Andonov, Goran Petrovi \'c , et al. Ai-assisted assessment of coding practices in modern code review. In Proceedings of the 1st ACM International Conference on AI-Powered Software, pp.\ 85--93, 2024

  43. [45]

    Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, and Daniel Fried. ECCO : Can we improve model-generated code efficiency without sacrificing functional correctness? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 15362--15376, Miami, Florida, U...

  44. [46]

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, H. Lai, I. Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, M. Moradshahi, Mihir Parmar, Mirali P...

  45. [47]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Kelvin Guu, Quoc V. Le, Adams Wei Yu, Nan Du, Vincent Zhao, Brian Lester, Andrew M. Dai, and Maarten Bosma. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021. URL https://api.semanticscholar.org/CorpusId:237416585

  46. [48]

    Code smell

    Wikipedia contributors . Code smell. https://en.wikipedia.org/wiki/Code_smell, 2024. Accessed: 2025-06-26

  47. [49]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

  48. [50]

    Training language models to generate quality code with program analysis feedback

    Feng Yao, Zilong Wang, Liyuan Liu, Junxia Cui, Li Zhong, Xiaohan Fu, Haohui Mai, Vish Krishnan, Jianfeng Gao, and Jingbo Shang. Training language models to generate quality code with program analysis feedback. arXiv preprint arXiv:2505.22704, 2025

  49. [51]

    Star: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

  50. [52]

    Generative verifiers: Reward modeling as next-token prediction

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. ArXiv, abs/2408.15240, 2024. URL https://api.semanticscholar.org/CorpusId:271963324

  51. [53]

    Self-guide: Better task-specific instruction following via self-synthetic finetuning

    Chenyang Zhao, Xueying Jia, Vijay Viswanathan, Tongshuang Wu, and Graham Neubig. Self-guide: Better task-specific instruction following via self-synthetic finetuning. arXiv preprint arXiv:2407.12874, 2024

  52. [54]

    Scaling reasoning without attention

    Xueliang Zhao, Wei Wu, and Lingpeng Kong. Scaling reasoning without attention. ArXiv, abs/2505.22425, 2025. URL https://api.semanticscholar.org/CorpusId:278959927

  53. [55]

    Beyond correctness: Benchmarking multi-dimensional code generation for large language models

    Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Beyond correctness: Benchmarking multi-dimensional code generation for large language models. arXiv preprint arXiv:2407.11470, 2024

  54. [56]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  55. [57]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  56. [58]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  57. [59]

    IDAT 8T c @ d! @ # p( W @ H CFUS @ D@P&6!@ 옳. @ D@P 5 @ R C ^ @ 8 d @ _ @ & ppo @ !i* @ d!Fa 6n5kcu>֭2rC|qclܸ16o 7o[ _[c۶ ۶ر ܱ=vL|ݻvĮ]bgسgW=gϞ _+/rr

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...