arxiv: 2507.11687 · v4 · submitted 2025-07-15 · 💻 cs.SE · cs.CL· cs.LG

MetaLint: Easy-to-Hard Generalization for Code Linting

Atharva Naik , Lawanya Baghel , Dhakshin Govindarajan , Darsh Agrawal , Yiqing Xie , Daniel Fried , Carolyn Rose This is my paper

Pith reviewed 2026-05-19 04:02 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG

keywords code lintingmeta-learninggeneralizationbest practicessynthetic datainstruction followingPython Enhancement ProposalsF-score

0 comments p. Extension

The pith

Models trained on synthetic data from simple linters generalize to detect harder, context-dependent code best practices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MetaLint as a way to train language models for code linting by treating it as following instructions about best practices given in natural language. Models learn from easy examples generated by automatic linters and then apply that to more difficult rules that lack such automatic tools. This leads to large performance gains on a new benchmark of hard practices drawn from Python Enhancement Proposals. A small model like Qwen3-4B shows a 2.7 times better F-score in detecting violations and matches larger models in localization.

Core claim

MetaLint formulates code linting as an instruction-following task where the model checks if code follows a provided natural language specification of best practices. Trained only on synthetic data from automatic linters, it generalizes to a human-curated set of harder best practices inspired by PEPs, achieving a 2.7x detection F-score gain from 25.9% to 70.4% for Qwen3-4B, along with 26.7% localization F-score.

What carries the argument

The instruction-following formulation of code linting against a variable natural language specification of best practices, which allows test-time control and generalization without retraining on specific rules.

If this is right

Performance gains hold across different programming languages, model families, and scales.
Models can enforce new or evolving best practices by simply changing the natural language description at test time.
Smaller models can match larger ones like o3-mini on localization tasks for these practices.
Releasing the code and benchmark enables further research on easy-to-hard generalization in code tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar meta-learning approaches could help models handle other evolving software engineering tasks beyond linting.
Reducing reliance on fixed linter sets might improve adaptability in AI coding assistants.
Testing on benchmarks with minimal overlap to training data would strengthen claims of true generalization.

Load-bearing premise

The gains on the human-curated benchmark come from learning generalizable patterns rather than from accidental overlap with the synthetic training data or from the specific prompts and metrics used.

What would settle it

Demonstrating that the hard best practices in the benchmark share substantial rules or examples with the synthetic linter data, or that changing the evaluation prompts eliminates the performance improvement.

Figures

Figures reproduced from arXiv: 2507.11687 by Atharva Naik, Carolyn Rose, Daniel Fried, Darsh Agrawal, Dhakshin Govindarajan, Lawanya Baghel, Yiqing Xie.

**Figure 1.** Figure 1: METALINT: (1) Synthetic data generation with linters/tools, (2) Supervised Instruction Fine-Tuning (SFT) on this data, and (3) Verifiable Reward Model derived from the linter. 4 EXPERIMENTS 4.1 EVALUATION METRICS We evaluate the LLM’s ability to detect idiom violations through two tasks: detection, which assesses whether a given idiom is violated in a code file, and localization, which evaluates whether t… view at source ↗

**Figure 2.** Figure 2: METALINT: Preference Optimization using reward model: (4) Rejection Sampling Direct Preference Optimization (RS-DPO), and (5) Rejection Sampling Supervised Fine-Tuning (RS-SFT). violating line numbers. To handle potential class imbalance, we use macro-averaging across idioms and exclude NO VIOLATION as a class to penalize models that only predict NO VIOLATIONS FOUND (such models will score zero on all dete… view at source ↗

**Figure 3.** Figure 3: ID: In-Domain, NeT: Near Transfer, FaT: Far Transfer. [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of comparative failures of the CoT M [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

read the original abstract

Large language models excel at code generation but struggle with code linting, particularly in generalizing to unseen or evolving best practices beyond those observed during training. We introduce MetaLint, a meta-learning framework that formulates code linting as an instruction-following task, where a model evaluates whether code adheres to a natural language specification of best practices. In contrast to prior work that trains models to detect violations from a fixed set of best practices, MetaLint evaluates code against a provided natural language specification, enabling test-time control over which practices to enforce and generalization to unseen or evolving rules without retraining. We demonstrate that models trained solely on synthetic data generated from automatic linters still generalize to harder, context-dependent best practices for which such linters are not available. To evaluate generalization beyond such easy signals, we introduce a human-curated benchmark of hard best practices inspired by Python Enhancement Proposals (PEPs). On this benchmark, MetaLint substantially improves performance without explicit fine-tuning on target best practices and exhibits strong, easy-to-hard generalization. Qwen3-4B achieves a 2.7x detection F-score gain (25.9% -> 70.4%), the highest recall, and a 26.7% localization F-score, matching larger models such as o3-mini. These gains generalize across programming languages, model families, scales, reasoning settings, and linter sources. We release the code and benchmark to support reproducibility and future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaLint gets models trained on synthetic linter data to handle harder PEP-style rules with reported big F-score gains, but the generalization story needs a clean check that the test cases aren't just broader versions of the training patterns.

read the letter

The main point is that they train on violations pulled from automatic linters, then test on a human-curated set of harder, context-heavy practices drawn from Python PEPs, and report a 2.7x F-score lift for a 4B model without any fine-tuning on the target rules. The setup treats linting as controllable instruction following rather than fixed-set detection, which lets the model take a new natural-language rule at test time. They also show the gains carry over to other languages and model families, and they release the benchmark and code, which is useful for follow-up work. That combination of synthetic training plus a held-out hard-practice benchmark is the concrete new piece relative to earlier fixed-rule linting papers. The numbers are presented clearly enough in the abstract to suggest the effect is real under their protocol, and the cross-language and cross-model checks add some weight. Releasing artifacts makes the result easier to inspect and build on. The soft spot is the distribution shift itself. The claim of easy-to-hard generalization requires that the human-curated rules sit outside the patterns already learnable from the automatic linter data. If naming conventions, import rules, or simple control-flow checks in the hard set overlap with what the synthetic examples already cover, the improvement could come from learning a wider but still linter-adjacent detector rather than true meta-instruction following. The abstract does not describe an explicit rule-by-rule overlap audit, so that detail will matter in the full methods. This work is aimed at people who build or evaluate adaptable code-quality tools and at researchers studying instruction following for software tasks. A reader who wants a practical benchmark and a clear easy-to-hard experiment will find it worth their time. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject, even if the overlap question needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces MetaLint, a meta-learning framework that formulates code linting as an instruction-following task where models evaluate code against natural language best-practice specifications. Models are trained exclusively on synthetic data generated from automatic linters and evaluated for generalization on a new human-curated benchmark of harder, context-dependent practices inspired by Python Enhancement Proposals (PEPs). The authors report a 2.7x detection F-score gain for Qwen3-4B (25.9% to 70.4%), highest recall, 26.7% localization F-score matching larger models, and generalization across languages, model families, scales, and linter sources, with code and benchmark released.

Significance. If the reported easy-to-hard generalization holds after addressing potential distributional overlap, the work would be significant for enabling flexible, test-time controllable code linting without retraining on evolving rules. The open release of the benchmark and implementation is a clear strength supporting reproducibility.

major comments (2)

[Abstract] Abstract: The central claim of generalization to 'harder, context-dependent best practices for which such linters are not available' is load-bearing for the meta-instruction-following interpretation. Without an explicit overlap audit, rule-by-rule mapping, or similarity analysis between the automatic-linter rules used for synthetic training data and the human-curated PEP rules in the test benchmark, it remains possible that the 2.7x F-score gain reflects learning broader linter-adjacent patterns rather than true out-of-distribution meta-generalization.
[Evaluation] Evaluation / Experiments: The abstract reports concrete F-score numbers and cross-language generalization, but the full experimental protocol (prompt templates, exact localization metric definition, and statistical tests for robustness) is not detailed. This makes it difficult to confirm that gains are stable to reasonable variations in evaluation setup.

minor comments (2)

[Abstract] The abstract could more precisely define what constitutes 'easy' versus 'hard' signals in the benchmarks to clarify the easy-to-hard generalization narrative.
Consider adding a limitations section discussing potential failure modes, such as sensitivity to prompt phrasing or performance on very long code contexts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and the opportunity to clarify our work. We address each major comment below and will make targeted revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of generalization to 'harder, context-dependent best practices for which such linters are not available' is load-bearing for the meta-instruction-following interpretation. Without an explicit overlap audit, rule-by-rule mapping, or similarity analysis between the automatic-linter rules used for synthetic training data and the human-curated PEP rules in the test benchmark, it remains possible that the 2.7x F-score gain reflects learning broader linter-adjacent patterns rather than true out-of-distribution meta-generalization.

Authors: We appreciate the referee's emphasis on rigorously establishing the out-of-distribution nature of the generalization. The synthetic training data is generated exclusively from automatic linters that implement straightforward, checkable rules, whereas the human-curated benchmark consists of nuanced, context-dependent practices drawn from PEPs that lack direct automated implementations. To strengthen this distinction, we will add an explicit overlap analysis section that includes (1) a rule-by-rule mapping between training linter rules and test specifications where feasible and (2) semantic similarity metrics (e.g., embedding-based cosine similarity) between the natural-language rule descriptions. This addition will quantify any potential distributional overlap and further support the meta-generalization claim. revision: yes
Referee: [Evaluation] Evaluation / Experiments: The abstract reports concrete F-score numbers and cross-language generalization, but the full experimental protocol (prompt templates, exact localization metric definition, and statistical tests for robustness) is not detailed. This makes it difficult to confirm that gains are stable to reasonable variations in evaluation setup.

Authors: We agree that greater detail on the evaluation protocol will enhance reproducibility and allow readers to assess robustness. Although the current manuscript outlines the high-level setup, we will expand the Experiments section to include the complete prompt templates, a precise mathematical definition of the localization F-score (including how spans are matched and scored), and results from statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) across multiple runs. These additions will confirm that the reported gains remain stable under reasonable variations in prompting and evaluation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical generalization on held-out human-curated benchmark

full rationale

The paper's central claim is an empirical performance delta (2.7x F-score gain on PEP-inspired hard practices) obtained by training solely on synthetic data from automatic linters and testing on a separately human-curated benchmark. No equations, fitted parameters, or self-citation chains are invoked to derive the generalization result; the methodology is a standard train-on-synthetic / evaluate-on-held-out setup whose outcome is not forced by construction from the training distribution. The derivation chain is therefore self-contained and externally falsifiable via the released benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard supervised fine-tuning assumptions.

pith-pipeline@v0.9.0 · 5816 in / 1097 out tokens · 54775 ms · 2026-05-19T04:02:49.355939+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce METALINT, a meta-learning framework that formulates code linting as an instruction-following task... models trained solely on synthetic data generated from automatic linters still generalize to harder, context-dependent best practices
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

training uses only synthetic data from automatic linters and testing uses human-curated hard practices

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 13 internal anchors

[1]

URL https://www.lintrule.com/

2023. URL https://www.lintrule.com/

work page 2023
[2]

Efficient model-agnostic alignment via bayesian persuasion

Fengshuo Bai, Mingzhi Wang, Zhaowei Zhang, Boyuan Chen, Yinda Xu, Ying Wen, and Yaodong Yang. Efficient model-agnostic alignment via bayesian persuasion. ArXiv, abs/2405.18718, 2024. URL https://api.semanticscholar.org/CorpusId:270094634

work page arXiv 2024
[3]

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Ma teusz Litwin, Scott Gray,...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

Bigo (bench)--can llms generate code with controlled time and space complexity? arXiv preprint arXiv:2503.15242, 2025

Pierre Chambon, Baptiste Roziere, Benoit Sagot, and Gabriel Synnaeve. Bigo (bench)--can llms generate code with controlled time and space complexity? arXiv preprint arXiv:2503.15242, 2025

work page arXiv 2025
[5]

Instruction diversity drives generalization to unseen tasks

Francois Charton, Justin Wang, and Dylan Zhang. Instruction diversity drives generalization to unseen tasks. ArXiv, abs/2402.10891, 2024. URL https://api.semanticscholar.org/CorpusId:267740368

work page arXiv 2024
[6]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, W. Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, S. Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, J. Dean, J...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Pep 506 -- adding a secrets module to the standard library

Steven D'Aprano. Pep 506 -- adding a secrets module to the standard library. https://peps.python.org/pep-0506/, 2017. Accessed: 2025-06-26

work page 2017
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Lintllm: An open-source verilog linting framework based on large language models, 2025

Zhigang Fang, Renzhi Chen, Zhijie Yang, Yang Guo, Huadong Dai, and Lei Wang. Lintllm: An open-source verilog linting framework based on large language models, 2025. URL https://arxiv.org/abs/2502.10815

work page arXiv 2025
[11]

Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow

Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow. Differentiable programs with neural libraries. In International Conference on Machine Learning, 2016. URL https://api.semanticscholar.org/CorpusId:15016881

work page 2016
[12]

Learning instructions with unlabeled data for zero-shot cross-task generalization

Yuxian Gu, Pei Ke, Xiaoyan Zhu, and Minlie Huang. Learning instructions with unlabeled data for zero-shot cross-task generalization. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusId:252918165

work page 2022
[13]

Guiding through complexity: What makes good supervision for hard math reasoning tasks? In unknown, 2024

Xuan He, Da Yin, and Nanyun Peng. Guiding through complexity: What makes good supervision for hard math reasoning tasks? In unknown, 2024. URL https://api.semanticscholar.org/CorpusId:278775190

work page 2024
[14]

Code linting using language models

Darren Holden and Nafiseh Kahani. Code linting using language models. arXiv preprint arXiv:2406.19508, 2024

work page arXiv 2024
[15]

Beyond single-task: Robust multi-task length generalization for llms

Yi Hu, Shijia Kang, Haotong Yang, Haotian Xu, and Muhan Zhang. Beyond single-task: Robust multi-task length generalization for llms. In unknown, 2025. URL https://api.semanticscholar.org/CorpusId:276408040

work page 2025
[16]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

S. Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O'Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke S. Zettlemoyer, and Veselin Stoyanov. Opt-iml: Scaling language model instruction meta learning through the lens of general...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Combining large language models with static analyzers for code review generation

Imen Jaoua, Oussama Ben Sghaier, and Houari Sahraoui. Combining large language models with static analyzers for code review generation. arXiv preprint arXiv:2502.06633, 2025

work page arXiv 2025
[20]

Jiang, B

K. Jiang, B. Jin, and P. Nie. CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories . In 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pp.\ 441--445, Ottawa, ON, Canada, 2025 a . doi:10.1109/MSR66628.2025.00075

work page doi:10.1109/msr66628.2025.00075 2025
[21]

Enhancing high-quality code generation in large language models with comparative prefix-tuning

Yuan Jiang, Yujian Zhang, Liang Lu, Christoph Treude, Xiaohong Su, Shan Huang, and Tiantian Wang. Enhancing high-quality code generation in large language models with comparative prefix-tuning. arXiv preprint arXiv:2503.09020, 2025 b

work page arXiv 2025
[22]

Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models

Saeed Khaki, JinJin Li, Lan Ma, Liu Yang, and Prathap Ramachandra. Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models. arXiv preprint arXiv:2402.10038, 2024

work page arXiv 2024
[23]

Understanding the effectiveness of large language models in detecting security vulnerabilities

Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. Understanding the effectiveness of large language models in detecting security vulnerabilities. arXiv preprint arXiv:2311.16169, 2023

work page arXiv 2023
[24]

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Cross-task generalization via natural language crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Annual Meeting of the Association for Computational Linguistics, 2021 a . URL https://api.semanticscholar.org/CorpusID:237421373

work page 2021
[26]

Cross-task generalization via natural language crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Annual Meeting of the Association for Computational Linguistics, 2021 b . URL https://api.semanticscholar.org/CorpusId:237421373

work page 2021
[27]

Common weakness enumeration (cwe)

MITRE Corporation . Common weakness enumeration (cwe). https://cwe.mitre.org/, 2024. Accessed: 2025-06-26

work page 2024
[28]

Crscore: Grounding automated evaluation of code review comments in code claims and smells

Atharva Naik, Marcus Alenius, Daniel Fried, and Carolyn Rose. Crscore: Grounding automated evaluation of code review comments in code claims and smells. arXiv preprint arXiv:2409.19801, 2024

work page arXiv 2024
[29]

Crosscodebench: Benchmarking cross-task generalization of source code models

Changan Niu, Chuanyi Li, Vincent Ng, and Bin Luo. Crosscodebench: Benchmarking cross-task generalization of source code models. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp.\ 537--549, 2023. URL https://api.semanticscholar.org/CorpusId:256662301

work page 2023
[30]

Openai o3 and o4‑mini system card

OpenAI. Openai o3 and o4‑mini system card. Technical report, OpenAI, 2025. Compact reasoning models with tool use, image analysis, and code capabilities

work page 2025
[31]

GPT-4.1 system card

OpenAI . GPT-4.1 system card. Technical report, OpenAI, San Francisco, CA, April 2025. URL https://openai.com/index/gpt-4-1/. Launch of GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano via API; improvements in coding, instruction following, long-context capacity, and efficiency

work page 2025
[32]

Kalathil, and Shuiwang Ji

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, D. Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning. In unknown, 2025. URL https://api.semanticscholar.org/CorpusId:279251658

work page 2025
[33]

How many data samples is an additional instruction worth? ArXiv, abs/2203.09161, 2022

Ravsehaj Singh Puri, Swaroop Mishra, Mihir Parmar, and Chitta Baral. How many data samples is an additional instruction worth? ArXiv, abs/2203.09161, 2022. URL https://api.semanticscholar.org/CorpusId:247518570

work page arXiv 2022
[34]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 0 53728--53741, 2023

work page 2023
[35]

Ai-powered code review with llms: Early results

Zeeshan Rasheed, Malik Abdul Sami, Muhammad Waseem, Kai-Kristian Kemell, Xiaofeng Wang, Anh Nguyen, Kari Syst \"a , and Pekka Abrahamsson. Ai-powered code review with llms: Early results. arXiv preprint arXiv:2404.18496, 2024

work page arXiv 2024
[36]

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan D. Chang, Mike Tian-Jian Jiang, Han Wang...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Can large reasoning models self-train? ArXiv, abs/2505.21444, 2025

Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, and Andrea Zanette. Can large reasoning models self-train? ArXiv, abs/2505.21444, 2025. URL https://api.semanticscholar.org/CorpusId:278911518

work page arXiv 2025
[38]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Easy-to-hard generalization: Scalable alignment beyond human supervision

Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, and Chuang Gan. Easy-to-hard generalization: Scalable alignment beyond human supervision. arXiv preprint arXiv:2403.09472, 2024 b

work page arXiv 2024
[41]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Gptlint, 4 2024

Scott Silvi Travis Fischer. Gptlint, 4 2024. URL https://github.com/gptlint/gptlint

work page 2024
[44]

Ai-assisted assessment of coding practices in modern code review

Manushree Vijayvergiya, Ma gorzata Salawa, Ivan Budiseli \'c , Dan Zheng, Pascal Lamblin, Marko Ivankovi \'c , Juanjo Carin, Mateusz Lewko, Jovan Andonov, Goran Petrovi \'c , et al. Ai-assisted assessment of coding practices in modern code review. In Proceedings of the 1st ACM International Conference on AI-Powered Software, pp.\ 85--93, 2024

work page 2024
[45]

Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, and Daniel Fried. ECCO : Can we improve model-generated code efficiency without sacrificing functional correctness? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 15362--15376, Miami, Florida, U...

work page doi:10.18653/v1/2024.emnlp-main.859 2024
[46]

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, H. Lai, I. Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, M. Moradshahi, Mihir Parmar, Mirali P...

work page 2022
[47]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Kelvin Guu, Quoc V. Le, Adams Wei Yu, Nan Du, Vincent Zhao, Brian Lester, Andrew M. Dai, and Maarten Bosma. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021. URL https://api.semanticscholar.org/CorpusId:237416585

work page internal anchor Pith review Pith/arXiv arXiv 2021
[48]

Code smell

Wikipedia contributors . Code smell. https://en.wikipedia.org/wiki/Code_smell, 2024. Accessed: 2025-06-26

work page 2024
[49]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Training language models to generate quality code with program analysis feedback

Feng Yao, Zilong Wang, Liyuan Liu, Junxia Cui, Li Zhong, Xiaohan Fu, Haohui Mai, Vish Krishnan, Jianfeng Gao, and Jingbo Shang. Training language models to generate quality code with program analysis feedback. arXiv preprint arXiv:2505.22704, 2025

work page arXiv 2025
[51]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

work page 2022
[52]

Generative verifiers: Reward modeling as next-token prediction

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. ArXiv, abs/2408.15240, 2024. URL https://api.semanticscholar.org/CorpusId:271963324

work page arXiv 2024
[53]

Self-guide: Better task-specific instruction following via self-synthetic finetuning

Chenyang Zhao, Xueying Jia, Vijay Viswanathan, Tongshuang Wu, and Graham Neubig. Self-guide: Better task-specific instruction following via self-synthetic finetuning. arXiv preprint arXiv:2407.12874, 2024

work page arXiv 2024
[54]

Scaling reasoning without attention

Xueliang Zhao, Wei Wu, and Lingpeng Kong. Scaling reasoning without attention. ArXiv, abs/2505.22425, 2025. URL https://api.semanticscholar.org/CorpusId:278959927

work page arXiv 2025
[55]

Beyond correctness: Benchmarking multi-dimensional code generation for large language models

Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Beyond correctness: Benchmarking multi-dimensional code generation for large language models. arXiv preprint arXiv:2407.11470, 2024

work page arXiv 2024
[56]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[57]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[58]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[59]

IDAT 8T c @ d! @ # p( W @ H CFUS @ D@P&6!@ 옳. @ D@P 5 @ R C ^ @ 8 d @ _ @ & ppo @ !i* @ d!Fa 6n5kcu>֭2rC|qclܸ16o 7o[ _[c۶ ۶ر ܱ=vL|ݻvĮ]bgسgW=gϞ _+/rr

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2024