Recognition: 2 theorem links
· Lean TheoremA Survey on Large Language Models for Code Generation
Pith reviewed 2026-05-13 20:13 UTC · model grok-4.3
The pith
A survey organizes recent work on large language models that generate code from natural language into a taxonomy with benchmark comparisons.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a taxonomy to categorize and discuss the recent developments in LLMs for code generation, covering aspects such as data curation, latest advances, performance evaluation, ethical implications, environmental impact, and real-world applications. In addition, we present a historical overview of the evolution of LLMs for code generation and offer an empirical comparison using the HumanEval, MBPP, and BigCodeBench benchmarks across various levels of difficulty and types of programming tasks to highlight the progressive enhancements in LLM capabilities for code generation. We identify critical challenges and promising opportunities regarding the gap between academia and practical.
What carries the argument
The taxonomy that groups LLM code-generation work by data curation, model advances, performance evaluation, ethics, environmental impact, and applications.
If this is right
- Newer models demonstrate clear gains on benchmarks of increasing difficulty.
- A persistent gap exists between academic results and requirements for practical software development.
- Ethical and environmental factors must be weighed together with accuracy when deploying these models.
- Shared resources allow the community to update the overview as new models appear.
Where Pith is reading between the lines
- The taxonomy could be reused as a template for surveys on related tasks such as code repair or explanation.
- Emphasis on environmental impact may encourage development of smaller or more efficient training methods.
- Benchmark trends point toward the need for harder, more realistic programming challenges to continue progress.
Load-bearing premise
The papers chosen and the taxonomy used to group them capture the full state of the field without major selection bias.
What would settle it
A set of recent publications on large language models for code generation that cannot be placed in any category of the taxonomy or that show benchmark trends opposite to the reported improvements.
read the original abstract
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. Despite the active exploration of LLMs for a variety of code tasks, either from the perspective of natural language processing (NLP) or software engineering (SE) or both, there is a noticeable absence of a comprehensive and up-to-date literature review dedicated to LLM for code generation. In this survey, we aim to bridge this gap by providing a systematic literature review that serves as a valuable reference for researchers investigating the cutting-edge progress in LLMs for code generation. We introduce a taxonomy to categorize and discuss the recent developments in LLMs for code generation, covering aspects such as data curation, latest advances, performance evaluation, ethical implications, environmental impact, and real-world applications. In addition, we present a historical overview of the evolution of LLMs for code generation and offer an empirical comparison using the HumanEval, MBPP, and BigCodeBench benchmarks across various levels of difficulty and types of programming tasks to highlight the progressive enhancements in LLM capabilities for code generation. We identify critical challenges and promising opportunities regarding the gap between academia and practical development. Furthermore, we have established a dedicated resource GitHub page (https://github.com/juyongjiang/CodeLLMSurvey) to continuously document and disseminate the most recent advances in the field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a survey on Large Language Models for code generation (Code LLMs). It introduces a taxonomy covering data curation, latest model advances, performance evaluation, ethical implications, environmental impact, and real-world applications; provides a historical overview of the field's evolution; presents empirical comparisons on HumanEval, MBPP, and BigCodeBench across difficulty levels and task types; identifies challenges and opportunities between academia and practice; and maintains a GitHub repository for ongoing updates.
Significance. If the literature coverage is comprehensive and unbiased, the survey would provide a valuable organizing reference for researchers at the NLP-SE intersection, with its multi-benchmark empirical section and explicit treatment of ethics/environmental factors offering practical utility beyond typical reviews. The living GitHub resource is a positive contribution for field currency.
major comments (1)
- [Introduction] Introduction and any dedicated methodology section: the manuscript positions itself as a 'systematic literature review' yet provides no reproducible protocol details (databases queried, exact search strings, date cutoffs, inclusion/exclusion criteria, or counts of papers screened versus included). This directly affects the defensibility of the taxonomy's completeness and the claim that it represents 'cutting-edge progress' without selection bias.
minor comments (2)
- [Taxonomy] Taxonomy section: ensure each top-level category (e.g., data curation) is accompanied by explicit criteria or examples of how papers were assigned, to improve reproducibility of the categorization.
- [Performance Evaluation] Benchmark comparison section: clarify whether the reported results use the exact same prompting setup and decoding parameters across models, or note any variations that could affect cross-model comparisons.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our survey. We address the major comment below and will incorporate the suggested changes in the revised manuscript.
read point-by-point responses
-
Referee: [Introduction] Introduction and any dedicated methodology section: the manuscript positions itself as a 'systematic literature review' yet provides no reproducible protocol details (databases queried, exact search strings, date cutoffs, inclusion/exclusion criteria, or counts of papers screened versus included). This directly affects the defensibility of the taxonomy's completeness and the claim that it represents 'cutting-edge progress' without selection bias.
Authors: We acknowledge this is a valid observation. Although the survey was conducted systematically, the original manuscript omitted an explicit methodology section. In the revision, we will add a dedicated Section 2 (Methodology) that details: (1) databases queried (arXiv, Google Scholar, ACL Anthology, IEEE Xplore, and DBLP); (2) exact search strings (e.g., (large language model OR LLM) AND (code generation OR code synthesis OR program synthesis)); (3) date cutoff (papers published or posted up to May 2024); (4) inclusion criteria (peer-reviewed or preprint works primarily addressing LLMs for code generation tasks, with empirical results or novel methods); (5) exclusion criteria (non-English papers, duplicates, surveys without new analysis, or works focused solely on non-generation code tasks); and (6) screening statistics (initial retrieval count, papers screened after title/abstract, and final included papers). This addition will strengthen reproducibility and mitigate concerns about selection bias while preserving the taxonomy and empirical comparisons. revision: yes
Circularity Check
No circularity in survey structure or claims
full rationale
This paper is a systematic literature review that introduces a taxonomy for organizing external research on LLMs for code generation and summarizes cited works across data curation, advances, evaluation, ethics, impact, and applications. It contains no original derivations, equations, predictions, fitted parameters, or self-referential claims that could reduce to the paper's own inputs by construction. The central assertions rest on the selection and synthesis of prior literature rather than any internal mathematical or predictive chain, rendering circularity analysis inapplicable.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclearIn addition, we present a historical overview of the evolution of LLMs for code generation and offer an empirical comparison using the HumanEval, MBPP, and BigCodeBench benchmarks
Forward citations
Cited by 26 Pith papers
-
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
-
BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design
BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.
-
Evaluating LLMs Code Reasoning Under Real-World Context
R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling
IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
-
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios
A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
-
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
-
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...
-
Automating Database-Native Function Code Synthesis with LLMs
DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreS...
-
Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts
An LLM framework with tailored prompts and a new dataset of 31,165 annotated instances achieves 0.92 positive recall and 0.85 negative recall for detecting 13 smart contract vulnerability categories.
-
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis
DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
-
Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment
Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.
-
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo g...
-
A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair
Metamorphic testing on Defects4J and GitBug-Java reveals substantial performance drops in seven LLMs that correlate with NLL, indicating data leakage in LLM-based program repair.
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models
A two-stage multi-agent LLM converts structural inputs to JSON then platform-specific scripts for ETABS, SAP2000, and OpenSees, achieving over 90% accuracy on 20 frame problems across ten trials.
-
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
-
Compiling Code LLMs into Lightweight Executables
Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0...
-
Video models are zero-shot learners and reasoners
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
-
Mono2Sls: Automated Monolith-to-Serverless Migration via Multi-Stage Pipeline with Static Analysis
Mono2Sls automates monolith-to-serverless migration with static analysis and multi-stage LLM agents, achieving 100% deployment success and 66.1% end-to-end correctness on six benchmarks.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit
Security practitioners use LLMs independently for low-risk productivity tasks while showing interest in enterprise platforms, but reliability, verification needs, and security risks limit broader autonomy.
-
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition
Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
- OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
Reference graph
Works this paper leans on
-
[1]
AgentGPT: Assemble, configure, and deploy autonomous AI Agents in your browser
2023. AgentGPT: Assemble, configure, and deploy autonomous AI Agents in your browser. https://github.com/ reworkd/AgentGPT
work page 2023
-
[2]
AutoGPT is the vision of accessible AI for everyone, to use and to build on
2023. AutoGPT is the vision of accessible AI for everyone, to use and to build on. https://github.com/Significant- Gravitas/AutoGPT
work page 2023
-
[3]
2023. BabyAGI. https://github.com/yoheinakajima/babyagi
work page 2023
-
[4]
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . 4998–5007
work page 2020
- [7]
-
[8]
Ali Al-Kaswan, Maliheh Izadi, and Arie Van Deursen. 2024. Traces of memorisation in large language models for code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering . 1–12
work page 2024
- [9]
-
[10]
Miltiadis Allamanis and Charles Sutton. 2014. Mining idioms from source code. In Proceedings of the 22nd acm sigsoft international symposium on foundations of software engineering . 472–483
work page 2014
-
[11]
Google DeepMind AlphaCode Team. 2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind- media/AlphaCode2/AlphaCode2_Tech_Report.pdf
work page 2023
-
[12]
Amazon. 2022. What is CodeWhisperer? https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is- cwspr.html
work page 2022
-
[13]
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Lo...
work page 2019
-
[14]
Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
work page 2024
-
[15]
Luca Ardito, Riccardo Coppola, Luca Barbato, and Diego Verga. 2020. A tool-based perspective on software code maintainability metrics: a systematic literature review. Scientific Programming 2020 (2020), 1–26
work page 2020
- [16]
-
[17]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [19]
-
[20]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [22]
-
[23]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72
work page 2005
-
[24]
Enrico Barbierato, Marco L Della Vedova, Daniele Tessera, Daniele Toti, and Nicola Vanoli. 2022. A methodology for controlling bias and fairness in synthetic data generation. Applied Sciences 12, 9 (2022), 4619
work page 2022
-
[25]
Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 85–111
work page 2023
-
[26]
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al . 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [27]
-
[28]
Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2022. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue 20, 6 (2022), 35–57
work page 2022
- [29]
-
[30]
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715 If you use this software, please cite it using these metadata
-
[31]
Veronika Bogina, Alan Hartman, Tsvi Kuflik, and Avital Shulner-Tal. 2022. Educating software and AI stakeholders about algorithmic fairness, accountability, transparency and ethics. International Journal of Artificial Intelligence in Education (2022), 1–26
work page 2022
-
[32]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[34]
Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability. IEEE Transactions on software engineering 36, 4 (2009), 546–558
work page 2009
- [35]
-
[36]
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. 2024. A survey on mixture of experts. arXiv preprint arXiv:2407.06204 (2024). J. ACM, Vol. 37, No. 4, Article 1. Publication date: August 2018. A Survey on Large Language Models for Code Generation 1:57
-
[37]
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) . 2633–2650
work page 2021
-
[38]
Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Carolyn Jane Anderson, Michael Greenberg, Abhinav Jangda, and Arjun Guha. 2023. Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs. arXiv preprint arXiv:2308.09895 (2023)
-
[39]
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al . 2022. A scalable and extensible approach to benchmarking nl2code for 18 programming languages. arXiv preprint arXiv:2208.08227 (2022)
- [40]
- [41]
-
[42]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 1–45
work page 2024
-
[43]
Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github. com/sahil280114/codealpaca
work page 2023
-
[44]
Binger Chen and Ziawasch Abedjan. 2023. DUETCS: Code Style Transfer through Generation and Retrieval. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 2362–2373
work page 2023
- [45]
-
[46]
Fuxiang Chen, Fatemeh H Fard, David Lo, and Timofey Bryksin. 2022. On the transferability of pre-trained language models for low-resource programming languages. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. 401–412
work page 2022
-
[47]
Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 17754–17762
work page 2024
-
[48]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
Stanley F Chen, Douglas Beeferman, and Roni Rosenfeld. 1998. Evaluation metrics for language models. (1998)
work page 1998
-
[50]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[51]
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. Advances in neural information processing systems 31 (2018)
work page 2018
-
[53]
Andrew A Chien, Liuzixuan Lin, Hai Nguyen, Varsha Rao, Tristan Sharma, and Rajini Wijayawardana. 2023. Reducing the Carbon Impact of Generative AI Inference (today and in 2035). In Proceedings of the 2nd workshop on sustainable computer systems. 1–7
work page 2023
-
[54]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113
work page 2023
- [55]
-
[56]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al . 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research 25, 70 (2024), 1–53
work page 2024
- [57]
-
[58]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[59]
CodeGemma Team, Ale Jakse Hartman, Andrea Hu, Christopher A. Choquette-Choo, Heri Zhao, Jane Fine, Jeffrey Hui, Jingyue Shen, Joe Kelley, Joshua Howland, Kshitij Bansal, Luke Vilnis, Mateo Wirth, Nam Nguyen, Paul Michel, J. ACM, Vol. 37, No. 4, Article 1. Publication date: August 2018. 1:58 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim P...
work page 2018
-
[60]
Codeium. 2023. Free, ultrafast Copilot alternative for Vim and Neovim. https://github.com/Exafunction/codeium.vim
work page 2023
-
[61]
Cognition. 2024. Introducing Devin, the first AI software engineer. https://www.cognition.ai/introducing-devin
work page 2024
-
[62]
Trevor Cohn, Phil Blunsom, and Sharon Goldwater. 2010. Inducing tree-substitution grammars. The Journal of Machine Learning Research 11 (2010), 3053–3096
work page 2010
-
[63]
Cognitive Computations. 2023. oa_leet10k. https://huggingface.co/datasets/cognitivecomputations/oa_leet10k
work page 2023
-
[64]
Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems . Springer, 337–340
work page 2008
-
[65]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems 36 (2024)
work page 2024
-
[66]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [67]
-
[68]
Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2024. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Advances in Neural Information Processing Systems 36 (2024)
work page 2024
- [69]
-
[70]
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [71]
-
[72]
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering . 1–13
work page 2024
-
[73]
Hugging Face. 2023. Training CodeParrot from Scratch. https://github.com/huggingface/blog/blob/main/codeparrot. md
work page 2023
-
[74]
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE) . IEEE, 31–53
work page 2023
-
[75]
Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 1469–1481
work page 2023
-
[76]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)
work page internal anchor Pith review arXiv 2020
- [77]
-
[78]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[79]
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Pal: Program-aided language models. In International Conference on Machine Learning . PMLR, 10764–10799
work page 2023
-
[80]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.