MetaLint: Easy-to-Hard Generalization for Code Linting
Pith reviewed 2026-05-19 04:02 UTC · model grok-4.3
The pith
Models trained on synthetic data from simple linters generalize to detect harder, context-dependent code best practices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MetaLint formulates code linting as an instruction-following task where the model checks if code follows a provided natural language specification of best practices. Trained only on synthetic data from automatic linters, it generalizes to a human-curated set of harder best practices inspired by PEPs, achieving a 2.7x detection F-score gain from 25.9% to 70.4% for Qwen3-4B, along with 26.7% localization F-score.
What carries the argument
The instruction-following formulation of code linting against a variable natural language specification of best practices, which allows test-time control and generalization without retraining on specific rules.
If this is right
- Performance gains hold across different programming languages, model families, and scales.
- Models can enforce new or evolving best practices by simply changing the natural language description at test time.
- Smaller models can match larger ones like o3-mini on localization tasks for these practices.
- Releasing the code and benchmark enables further research on easy-to-hard generalization in code tasks.
Where Pith is reading between the lines
- Similar meta-learning approaches could help models handle other evolving software engineering tasks beyond linting.
- Reducing reliance on fixed linter sets might improve adaptability in AI coding assistants.
- Testing on benchmarks with minimal overlap to training data would strengthen claims of true generalization.
Load-bearing premise
The gains on the human-curated benchmark come from learning generalizable patterns rather than from accidental overlap with the synthetic training data or from the specific prompts and metrics used.
What would settle it
Demonstrating that the hard best practices in the benchmark share substantial rules or examples with the synthetic linter data, or that changing the evaluation prompts eliminates the performance improvement.
Figures
read the original abstract
Large language models excel at code generation but struggle with code linting, particularly in generalizing to unseen or evolving best practices beyond those observed during training. We introduce MetaLint, a meta-learning framework that formulates code linting as an instruction-following task, where a model evaluates whether code adheres to a natural language specification of best practices. In contrast to prior work that trains models to detect violations from a fixed set of best practices, MetaLint evaluates code against a provided natural language specification, enabling test-time control over which practices to enforce and generalization to unseen or evolving rules without retraining. We demonstrate that models trained solely on synthetic data generated from automatic linters still generalize to harder, context-dependent best practices for which such linters are not available. To evaluate generalization beyond such easy signals, we introduce a human-curated benchmark of hard best practices inspired by Python Enhancement Proposals (PEPs). On this benchmark, MetaLint substantially improves performance without explicit fine-tuning on target best practices and exhibits strong, easy-to-hard generalization. Qwen3-4B achieves a 2.7x detection F-score gain (25.9% -> 70.4%), the highest recall, and a 26.7% localization F-score, matching larger models such as o3-mini. These gains generalize across programming languages, model families, scales, reasoning settings, and linter sources. We release the code and benchmark to support reproducibility and future work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MetaLint, a meta-learning framework that formulates code linting as an instruction-following task where models evaluate code against natural language best-practice specifications. Models are trained exclusively on synthetic data generated from automatic linters and evaluated for generalization on a new human-curated benchmark of harder, context-dependent practices inspired by Python Enhancement Proposals (PEPs). The authors report a 2.7x detection F-score gain for Qwen3-4B (25.9% to 70.4%), highest recall, 26.7% localization F-score matching larger models, and generalization across languages, model families, scales, and linter sources, with code and benchmark released.
Significance. If the reported easy-to-hard generalization holds after addressing potential distributional overlap, the work would be significant for enabling flexible, test-time controllable code linting without retraining on evolving rules. The open release of the benchmark and implementation is a clear strength supporting reproducibility.
major comments (2)
- [Abstract] Abstract: The central claim of generalization to 'harder, context-dependent best practices for which such linters are not available' is load-bearing for the meta-instruction-following interpretation. Without an explicit overlap audit, rule-by-rule mapping, or similarity analysis between the automatic-linter rules used for synthetic training data and the human-curated PEP rules in the test benchmark, it remains possible that the 2.7x F-score gain reflects learning broader linter-adjacent patterns rather than true out-of-distribution meta-generalization.
- [Evaluation] Evaluation / Experiments: The abstract reports concrete F-score numbers and cross-language generalization, but the full experimental protocol (prompt templates, exact localization metric definition, and statistical tests for robustness) is not detailed. This makes it difficult to confirm that gains are stable to reasonable variations in evaluation setup.
minor comments (2)
- [Abstract] The abstract could more precisely define what constitutes 'easy' versus 'hard' signals in the benchmarks to clarify the easy-to-hard generalization narrative.
- Consider adding a limitations section discussing potential failure modes, such as sensitivity to prompt phrasing or performance on very long code contexts.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and the opportunity to clarify our work. We address each major comment below and will make targeted revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of generalization to 'harder, context-dependent best practices for which such linters are not available' is load-bearing for the meta-instruction-following interpretation. Without an explicit overlap audit, rule-by-rule mapping, or similarity analysis between the automatic-linter rules used for synthetic training data and the human-curated PEP rules in the test benchmark, it remains possible that the 2.7x F-score gain reflects learning broader linter-adjacent patterns rather than true out-of-distribution meta-generalization.
Authors: We appreciate the referee's emphasis on rigorously establishing the out-of-distribution nature of the generalization. The synthetic training data is generated exclusively from automatic linters that implement straightforward, checkable rules, whereas the human-curated benchmark consists of nuanced, context-dependent practices drawn from PEPs that lack direct automated implementations. To strengthen this distinction, we will add an explicit overlap analysis section that includes (1) a rule-by-rule mapping between training linter rules and test specifications where feasible and (2) semantic similarity metrics (e.g., embedding-based cosine similarity) between the natural-language rule descriptions. This addition will quantify any potential distributional overlap and further support the meta-generalization claim. revision: yes
-
Referee: [Evaluation] Evaluation / Experiments: The abstract reports concrete F-score numbers and cross-language generalization, but the full experimental protocol (prompt templates, exact localization metric definition, and statistical tests for robustness) is not detailed. This makes it difficult to confirm that gains are stable to reasonable variations in evaluation setup.
Authors: We agree that greater detail on the evaluation protocol will enhance reproducibility and allow readers to assess robustness. Although the current manuscript outlines the high-level setup, we will expand the Experiments section to include the complete prompt templates, a precise mathematical definition of the localization F-score (including how spans are matched and scored), and results from statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) across multiple runs. These additions will confirm that the reported gains remain stable under reasonable variations in prompting and evaluation choices. revision: yes
Circularity Check
No circularity: empirical generalization on held-out human-curated benchmark
full rationale
The paper's central claim is an empirical performance delta (2.7x F-score gain on PEP-inspired hard practices) obtained by training solely on synthetic data from automatic linters and testing on a separately human-curated benchmark. No equations, fitted parameters, or self-citation chains are invoked to derive the generalization result; the methodology is a standard train-on-synthetic / evaluate-on-held-out setup whose outcome is not forced by construction from the training distribution. The derivation chain is therefore self-contained and externally falsifiable via the released benchmark.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce METALINT, a meta-learning framework that formulates code linting as an instruction-following task... models trained solely on synthetic data generated from automatic linters still generalize to harder, context-dependent best practices
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
training uses only synthetic data from automatic linters and testing uses human-curated hard practices
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Efficient model-agnostic alignment via bayesian persuasion
Fengshuo Bai, Mingzhi Wang, Zhaowei Zhang, Boyuan Chen, Yinda Xu, Ying Wen, and Yaodong Yang. Efficient model-agnostic alignment via bayesian persuasion. ArXiv, abs/2405.18718, 2024. URL https://api.semanticscholar.org/CorpusId:270094634
-
[3]
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Ma teusz Litwin, Scott Gray,...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[4]
Pierre Chambon, Baptiste Roziere, Benoit Sagot, and Gabriel Synnaeve. Bigo (bench)--can llms generate code with controlled time and space complexity? arXiv preprint arXiv:2503.15242, 2025
-
[5]
Instruction diversity drives generalization to unseen tasks
Francois Charton, Justin Wang, and Dylan Zhang. Instruction diversity drives generalization to unseen tasks. ArXiv, abs/2402.10891, 2024. URL https://api.semanticscholar.org/CorpusId:267740368
-
[6]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, W. Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, S. Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, J. Dean, J...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Pep 506 -- adding a secrets module to the standard library
Steven D'Aprano. Pep 506 -- adding a secrets module to the standard library. https://peps.python.org/pep-0506/, 2017. Accessed: 2025-06-26
work page 2017
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Lintllm: An open-source verilog linting framework based on large language models, 2025
Zhigang Fang, Renzhi Chen, Zhijie Yang, Yang Guo, Huadong Dai, and Lei Wang. Lintllm: An open-source verilog linting framework based on large language models, 2025. URL https://arxiv.org/abs/2502.10815
-
[11]
Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow
Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow. Differentiable programs with neural libraries. In International Conference on Machine Learning, 2016. URL https://api.semanticscholar.org/CorpusId:15016881
work page 2016
-
[12]
Learning instructions with unlabeled data for zero-shot cross-task generalization
Yuxian Gu, Pei Ke, Xiaoyan Zhu, and Minlie Huang. Learning instructions with unlabeled data for zero-shot cross-task generalization. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusId:252918165
work page 2022
-
[13]
Xuan He, Da Yin, and Nanyun Peng. Guiding through complexity: What makes good supervision for hard math reasoning tasks? In unknown, 2024. URL https://api.semanticscholar.org/CorpusId:278775190
work page 2024
-
[14]
Code linting using language models
Darren Holden and Nafiseh Kahani. Code linting using language models. arXiv preprint arXiv:2406.19508, 2024
-
[15]
Beyond single-task: Robust multi-task length generalization for llms
Yi Hu, Shijia Kang, Haotong Yang, Haotian Xu, and Muhan Zhang. Beyond single-task: Robust multi-task length generalization for llms. In unknown, 2025. URL https://api.semanticscholar.org/CorpusId:276408040
work page 2025
-
[16]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
S. Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O'Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke S. Zettlemoyer, and Veselin Stoyanov. Opt-iml: Scaling language model instruction meta learning through the lens of general...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Combining large language models with static analyzers for code review generation
Imen Jaoua, Oussama Ben Sghaier, and Houari Sahraoui. Combining large language models with static analyzers for code review generation. arXiv preprint arXiv:2502.06633, 2025
-
[20]
K. Jiang, B. Jin, and P. Nie. CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories . In 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pp.\ 441--445, Ottawa, ON, Canada, 2025 a . doi:10.1109/MSR66628.2025.00075
-
[21]
Enhancing high-quality code generation in large language models with comparative prefix-tuning
Yuan Jiang, Yujian Zhang, Liang Lu, Christoph Treude, Xiaohong Su, Shan Huang, and Tiantian Wang. Enhancing high-quality code generation in large language models with comparative prefix-tuning. arXiv preprint arXiv:2503.09020, 2025 b
-
[22]
Saeed Khaki, JinJin Li, Lan Ma, Liu Yang, and Prathap Ramachandra. Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models. arXiv preprint arXiv:2402.10038, 2024
-
[23]
Understanding the effectiveness of large language models in detecting security vulnerabilities
Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. Understanding the effectiveness of large language models in detecting security vulnerabilities. arXiv preprint arXiv:2311.16169, 2023
-
[24]
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Cross-task generalization via natural language crowdsourcing instructions
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Annual Meeting of the Association for Computational Linguistics, 2021 a . URL https://api.semanticscholar.org/CorpusID:237421373
work page 2021
-
[26]
Cross-task generalization via natural language crowdsourcing instructions
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Annual Meeting of the Association for Computational Linguistics, 2021 b . URL https://api.semanticscholar.org/CorpusId:237421373
work page 2021
-
[27]
Common weakness enumeration (cwe)
MITRE Corporation . Common weakness enumeration (cwe). https://cwe.mitre.org/, 2024. Accessed: 2025-06-26
work page 2024
-
[28]
Crscore: Grounding automated evaluation of code review comments in code claims and smells
Atharva Naik, Marcus Alenius, Daniel Fried, and Carolyn Rose. Crscore: Grounding automated evaluation of code review comments in code claims and smells. arXiv preprint arXiv:2409.19801, 2024
-
[29]
Crosscodebench: Benchmarking cross-task generalization of source code models
Changan Niu, Chuanyi Li, Vincent Ng, and Bin Luo. Crosscodebench: Benchmarking cross-task generalization of source code models. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp.\ 537--549, 2023. URL https://api.semanticscholar.org/CorpusId:256662301
work page 2023
-
[30]
Openai o3 and o4‑mini system card
OpenAI. Openai o3 and o4‑mini system card. Technical report, OpenAI, 2025. Compact reasoning models with tool use, image analysis, and code capabilities
work page 2025
-
[31]
OpenAI . GPT-4.1 system card. Technical report, OpenAI, San Francisco, CA, April 2025. URL https://openai.com/index/gpt-4-1/. Launch of GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano via API; improvements in coding, instruction following, long-context capacity, and efficiency
work page 2025
-
[32]
Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, D. Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning. In unknown, 2025. URL https://api.semanticscholar.org/CorpusId:279251658
work page 2025
-
[33]
How many data samples is an additional instruction worth? ArXiv, abs/2203.09161, 2022
Ravsehaj Singh Puri, Swaroop Mishra, Mihir Parmar, and Chitta Baral. How many data samples is an additional instruction worth? ArXiv, abs/2203.09161, 2022. URL https://api.semanticscholar.org/CorpusId:247518570
-
[34]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 0 53728--53741, 2023
work page 2023
-
[35]
Ai-powered code review with llms: Early results
Zeeshan Rasheed, Malik Abdul Sami, Muhammad Waseem, Kai-Kristian Kemell, Xiaofeng Wang, Anh Nguyen, Kari Syst \"a , and Pekka Abrahamsson. Ai-powered code review with llms: Early results. arXiv preprint arXiv:2404.18496, 2024
-
[36]
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan D. Chang, Mike Tian-Jian Jiang, Han Wang...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[37]
Can large reasoning models self-train? ArXiv, abs/2505.21444, 2025
Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, and Andrea Zanette. Can large reasoning models self-train? ArXiv, abs/2505.21444, 2025. URL https://api.semanticscholar.org/CorpusId:278911518
-
[38]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Easy-to-hard generalization: Scalable alignment beyond human supervision
Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, and Chuang Gan. Easy-to-hard generalization: Scalable alignment beyond human supervision. arXiv preprint arXiv:2403.09472, 2024 b
-
[41]
Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Scott Silvi Travis Fischer. Gptlint, 4 2024. URL https://github.com/gptlint/gptlint
work page 2024
-
[44]
Ai-assisted assessment of coding practices in modern code review
Manushree Vijayvergiya, Ma gorzata Salawa, Ivan Budiseli \'c , Dan Zheng, Pascal Lamblin, Marko Ivankovi \'c , Juanjo Carin, Mateusz Lewko, Jovan Andonov, Goran Petrovi \'c , et al. Ai-assisted assessment of coding practices in modern code review. In Proceedings of the 1st ACM International Conference on AI-Powered Software, pp.\ 85--93, 2024
work page 2024
-
[45]
Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, and Daniel Fried. ECCO : Can we improve model-generated code efficiency without sacrificing functional correctness? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 15362--15376, Miami, Florida, U...
-
[46]
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, H. Lai, I. Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, M. Moradshahi, Mihir Parmar, Mirali P...
work page 2022
-
[47]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Kelvin Guu, Quoc V. Le, Adams Wei Yu, Nan Du, Vincent Zhao, Brian Lester, Andrew M. Dai, and Maarten Bosma. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021. URL https://api.semanticscholar.org/CorpusId:237416585
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[48]
Wikipedia contributors . Code smell. https://en.wikipedia.org/wiki/Code_smell, 2024. Accessed: 2025-06-26
work page 2024
-
[49]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Training language models to generate quality code with program analysis feedback
Feng Yao, Zilong Wang, Liyuan Liu, Junxia Cui, Li Zhong, Xiaohan Fu, Haohui Mai, Vish Krishnan, Jianfeng Gao, and Jingbo Shang. Training language models to generate quality code with program analysis feedback. arXiv preprint arXiv:2505.22704, 2025
-
[51]
Star: Bootstrapping reasoning with reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022
work page 2022
-
[52]
Generative verifiers: Reward modeling as next-token prediction
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. ArXiv, abs/2408.15240, 2024. URL https://api.semanticscholar.org/CorpusId:271963324
-
[53]
Self-guide: Better task-specific instruction following via self-synthetic finetuning
Chenyang Zhao, Xueying Jia, Vijay Viswanathan, Tongshuang Wu, and Graham Neubig. Self-guide: Better task-specific instruction following via self-synthetic finetuning. arXiv preprint arXiv:2407.12874, 2024
-
[54]
Scaling reasoning without attention
Xueliang Zhao, Wei Wu, and Lingpeng Kong. Scaling reasoning without attention. ArXiv, abs/2505.22425, 2025. URL https://api.semanticscholar.org/CorpusId:278959927
-
[55]
Beyond correctness: Benchmarking multi-dimensional code generation for large language models
Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Beyond correctness: Benchmarking multi-dimensional code generation for large language models. arXiv preprint arXiv:2407.11470, 2024
-
[56]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[57]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[58]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[59]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.