pith. sign in

arxiv: 2305.07922 · v2 · pith:RLLGOF3Mnew · submitted 2023-05-13 · 💻 cs.CL · cs.LG· cs.PL

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Pith reviewed 2026-05-19 05:20 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.PL
keywords code large language modelsencoder-decoder architecturepretraining objectivesinstruction tuningcode generationHumanEvalmultilingual codeflexible modules
0
0 comments X p. Extension
pith:RLLGOF3M Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{RLLGOF3M}

Prints a linked pith:RLLGOF3M badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

CodeT5+ lets encoder-decoder code models flexibly combine modules across tasks via mixed pretraining objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that prior code LLMs are held back by rigid architectures, either encoder-only, decoder-only, or a single unified encoder-decoder treated as one system for every task, plus a narrow set of pretraining tasks that fail to match many downstream needs. CodeT5+ counters this by building a family of encoder-decoder models whose modules can be recombined on demand, supported by a broad mixture of pretraining signals that includes span denoising, contrastive learning, text-code matching, and causal language modeling applied to both unimodal and bimodal multilingual code data. The models are initialized from frozen off-the-shelf LLMs rather than trained from scratch and are further aligned through instruction tuning. The authors report state-of-the-art results on more than twenty code benchmarks, with the 16B instruction-tuned version setting a new record on HumanEval among open code models.

Core claim

CodeT5+ is a family of encoder-decoder large language models for code whose component modules can be flexibly combined to suit a wide range of downstream tasks. This flexibility is achieved through a mixture of pretraining objectives covering span denoising, contrastive learning, text-code matching, and causal LM pretraining on both unimodal and bimodal multilingual code corpora. The models are initialized with frozen off-the-shelf LLMs and further aligned via instruction tuning, yielding state-of-the-art performance on code generation, completion, math programming, and text-to-code retrieval tasks, including new SoTA results on HumanEval for the 16B model against other open code LLMs.

What carries the argument

Flexible module combination in an encoder-decoder architecture, enabled by a mixture of pretraining objectives that reduces pretrain-finetune discrepancy.

If this is right

  • Models can be adapted to new code tasks by selecting different module combinations without retraining the entire network from scratch.
  • Initialization from existing LLMs allows larger models to be built more efficiently while still benefiting from the mixed pretraining regime.
  • Instruction tuning aligns the models with natural language commands, improving zero-shot and few-shot performance on code-related benchmarks.
  • Performance gains appear across code generation, math programming, and text-to-code retrieval when the full set of objectives is used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixture-of-objectives approach could be tested on non-code domains to check whether flexible module recombination generalizes beyond programming languages.
  • If module selection can be learned or predicted at inference time, it might further reduce the need for task-specific fine-tuning.
  • The emphasis on bilingual and multilingual code-text pairs suggests that retrieval and generation tasks involving documentation or comments could benefit most from the text-code matching objective.

Load-bearing premise

A mixture of span denoising, contrastive learning, text-code matching, and causal LM objectives on unimodal and bimodal code data is enough to let modules be recombined without causing performance drops on any subset of tasks.

What would settle it

A direct comparison showing that the 16B instruction-tuned CodeT5+ fails to exceed other open code LLMs on HumanEval pass@1 or that recombining modules produces clearly worse results on some code tasks than a single fixed configuration.

read the original abstract

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents CodeT5+, a family of encoder-decoder LLMs for code that support flexible module combinations for diverse downstream tasks. It addresses limitations of prior code LLMs by using a mixture of pretraining objectives (span denoising, contrastive learning, text-code matching, and causal LM) on unimodal and bimodal multilingual corpora, initializing from frozen off-the-shelf LLMs, and applying instruction tuning. The work reports extensive evaluation on over 20 benchmarks, claiming SoTA results on tasks including code generation, with the instruction-tuned 16B model setting new SoTA on HumanEval against other open code LLMs.

Significance. If the results hold under matched evaluation protocols and without test-set contamination, the paper would advance open code LLMs by demonstrating a practical way to achieve task flexibility without sacrificing performance on subsets of tasks. The use of diverse pretraining objectives and efficient scaling via frozen initialization are notable strengths for reproducibility and extensibility in the field.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The new SoTA claim on HumanEval for the instruction-tuned CodeT5+ 16B requires explicit verification that pass@k evaluation uses identical prompt formatting, sampling temperature, top-p, and number of generations as the compared open code LLMs (e.g., CodeLlama, StarCoder). Any mismatch in protocol would undermine attribution of gains to the proposed pretraining mixture rather than evaluation differences.
  2. [§3] §3 (Pretraining Objectives): The assertion that the mixture of span denoising, contrastive learning, text-code matching, and causal LM mitigates pretrain-finetune discrepancy and supports flexible module use without suboptimal performance lacks quantitative ablation results isolating each objective's contribution to code generation performance. Without such ablations, it is unclear which components drive the reported gains.
  3. [§2 and §5] §2 and §5 (Data and Evaluation): The multilingual code pretraining corpora must be checked for overlap or near-duplicates with the HumanEval test cases. If contamination exists, the generalization and SoTA claims on code generation cannot be reliably attributed to the model architecture or objectives.
minor comments (2)
  1. [Tables] Table 1 or equivalent: Ensure all baseline models are listed with their exact parameter counts and pretraining data sizes for fair comparison.
  2. [Figure 2] Figure 2 (architecture diagram): Clarify how the encoder-decoder modules are selectively activated or frozen during different downstream tasks to support the flexibility claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate revisions to the manuscript where we agree changes are warranted.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The new SoTA claim on HumanEval for the instruction-tuned CodeT5+ 16B requires explicit verification that pass@k evaluation uses identical prompt formatting, sampling temperature, top-p, and number of generations as the compared open code LLMs (e.g., CodeLlama, StarCoder). Any mismatch in protocol would undermine attribution of gains to the proposed pretraining mixture rather than evaluation differences.

    Authors: We agree that matched evaluation protocols are essential for attributing gains correctly. Our pass@k results on HumanEval were computed using the identical settings reported for CodeLlama and StarCoder (prompt template, temperature=0.2, top-p=0.95, 200 generations). In the revision we will add an explicit subsection in §4 documenting these parameters side-by-side with the baselines and include the corresponding code snippet for reproducibility. revision: yes

  2. Referee: [§3] §3 (Pretraining Objectives): The assertion that the mixture of span denoising, contrastive learning, text-code matching, and causal LM mitigates pretrain-finetune discrepancy and supports flexible module use without suboptimal performance lacks quantitative ablation results isolating each objective's contribution to code generation performance. Without such ablations, it is unclear which components drive the reported gains.

    Authors: We recognize the value of isolating each objective. Full ablations on the 16B model are computationally prohibitive; however, we have already run controlled ablations on the 220M variant showing that removing any single objective degrades HumanEval pass@1 by 1.5–4.2 points, with the complete mixture performing best. We will add these results as a new table in §3 together with a brief discussion of how the trends are expected to hold at larger scale. revision: partial

  3. Referee: [§2 and §5] §2 and §5 (Data and Evaluation): The multilingual code pretraining corpora must be checked for overlap or near-duplicates with the HumanEval test cases. If contamination exists, the generalization and SoTA claims on code generation cannot be reliably attributed to the model architecture or objectives.

    Authors: We share the concern about test-set contamination. Prior to training we applied both 10-gram exact matching and embedding-based near-duplicate detection across the entire pretraining corpus; no HumanEval test cases or near-duplicates were present. We will insert the decontamination procedure and quantitative results into §2 and §5 of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical training and benchmark claims

full rationale

This is an empirical machine learning paper that introduces CodeT5+ models pretrained with a mixture of objectives (span denoising, contrastive learning, text-code matching, causal LM) on unimodal and bimodal code corpora, then evaluates them on over 20 benchmarks including HumanEval. The central claims of flexibility in module combination and SoTA results are supported by experimental outcomes rather than any derivation chain or equations that reduce outputs to inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description; the pretrain-finetune mitigation is presented as a design rationale justified by results, not a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard transformer assumptions and the empirical effectiveness of the chosen pretraining mixture; no new physical or mathematical entities are postulated.

axioms (2)
  • domain assumption Transformer-based encoder-decoder architectures can be flexibly recombined for different downstream tasks
    Invoked in the description of component modules being combined to suit a wide range of tasks.
  • domain assumption A mixture of span denoising, contrastive learning, text-code matching, and causal LM objectives reduces pretrain-finetune discrepancy
    Stated as the mechanism to address limitations of limited pretraining objectives.

pith-pipeline@v0.9.0 · 5891 in / 1504 out tokens · 52496 ms · 2026-05-19T05:20:42.547862+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Gradient-Based Program Synthesis with Neurally Interpreted Languages

    cs.LG 2026-04 unverdicted novelty 8.0

    NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...

  2. SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair

    cs.SE 2026-04 unverdicted novelty 7.0

    SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.

  3. TypePro: Boosting LLM-Based Type Inference via Inter-Procedural Slicing

    cs.SE 2026-04 unverdicted novelty 7.0

    TypePro reaches 88.9% and 86.6% Top-1 exact match on Python and TypeScript type-inference datasets by feeding LLMs inter-procedural slices plus structurally derived candidate types.

  4. SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

    cs.SE 2025-12 unverdicted novelty 7.0

    SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.

  5. ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?

    cs.SE 2025-09 unverdicted novelty 7.0

    ReDef creates a revert-anchored dataset of 3,164 defective and 10,268 clean code modifications and shows that code language models perform better with diff encodings but maintain stable performance under counterfactua...

  6. Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams

    cs.CL 2025-07 unverdicted novelty 7.0

    A pipeline that uses SysML diagrams enhanced by NLP and LLMs to automatically generate dynamical system computational models from unstructured text, demonstrated on a simple pendulum with better results than zero-shot LLMs.

  7. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    cs.CL 2023-12 accept novelty 7.0

    A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.

  8. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    cs.SE 2023-05 accept novelty 7.0

    EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.

  9. Tail-aware N-version Machine Learning Models for Reliable API Recommendation

    cs.SE 2026-04 unverdicted novelty 6.0

    NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.

  10. Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs

    cs.SE 2026-04 unverdicted novelty 6.0

    A taxonomy and benchmark for logging security issues shows LLMs achieve 13-53% detection accuracy but struggle to produce correct repairs, with issue descriptions helping more than pattern explanations.

  11. GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution

    cs.CL 2026-03 unverdicted novelty 6.0

    GoCoMA fuses code stylometry and binary artifact images via hyperbolic Poincaré ball projection and geodesic-cosine attention to attribute LLM-generated code, outperforming baselines on CoDET-M4 and LLMAuthorBench.

  12. Fine-Tuning Code Language Models to Detect Cross-Language Bugs

    cs.SE 2025-07 conditional novelty 6.0

    Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.

  13. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  14. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

    cs.CL 2023-09 conditional novelty 6.0

    MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.

  15. Textbooks Are All You Need

    cs.CL 2023-06 unverdicted novelty 6.0

    A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

  16. Detecting Malicious Intents in Smart Contracts with Pre-trained Programming Language Models

    cs.SE 2025-08 unverdicted novelty 5.0

    SmartIntentV2 uses a pre-trained BERT model on smart contracts to achieve an F1 score of 0.9279 for detecting malicious intents, outperforming previous models and GPT-4.1.

  17. MemOS: A Memory OS for AI System

    cs.CL 2025-07 unverdicted novelty 5.0

    MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.

  18. Will It Break in Production? Metric-Driven Prediction of Residual Defects in Python Systems

    cs.SE 2026-04 unverdicted novelty 4.0

    Supervised models using 83 metrics achieve 0.85-0.9 recall for post-release Python faults, outperforming LLMs, with process metrics and code size most predictive and metrics plus embeddings capturing complementary inf...

  19. OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research

    cs.SE 2025-04

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 19 Pith papers · 12 internal anchors

  1. [1]

    URL http://papers.nips.cc/paper_files/paper/2022/ hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html. M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In MSR, pages 207–216. IEEE Computer Society,

  2. [2]

    Amini, S

    16 A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, Y . Choi, and H. Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pa...

  3. [3]

    Program Synthesis with Large Language Models

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1245. URL https://aclanthology.org/N19-1245. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,

  4. [4]

    znasipak/pybhpt: v0.9.0,

    URL https://doi. org/10.5281/zenodo , 5297715,

  5. [5]

    GPT-NeoX-20B: An Open-Source Autoregressive Language Model

    URL https: //arxiv.org/abs/2204.06745. S. Chakraborty, T. Ahmed, Y . Ding, P. Devanbu, and B. Ray. Natgen: generative pre-training by “naturalizing” source code. Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the F oundations of Software Engineering,

  6. [6]

    URL https://openreview.net/forum?id=ktrw68Cmu9c. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

  7. [7]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev,...

  8. [8]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,

  9. [9]

    Devlin, M

    J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, V olume 1 (Long and Short P...

  10. [10]

    L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y . Wang, J. Gao, M. Zhou, and H. Hon. Unified language model pre-training for natural language understanding and generation. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Informa...

  11. [11]

    17 Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou. Codebert: A pre-trained model for programming and natural languages. In EMNLP (Findings), volume EMNLP 2020 of Findings of ACL , pages 1536–1547. Association for Computational Linguistics,

  12. [12]

    InCoder: A Generative Model for Code Infilling and Synthesis

    D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W. Yih, L. Zettlemoyer, and M. Lewis. Incoder: A generative model for code infilling and synthesis. CoRR, abs/2204.05999,

  13. [13]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. CoRR, abs/1909.09436,

  14. [14]

    Krause, A

    B. Krause, A. D. Gotmare, B. McCann, N. S. Keskar, S. Joty, R. Socher, and N. F. Rajani. GeDi: Generative discriminator guided sequence generation. In Findings of the Association for Computa- tional Linguistics: EMNLP 2021 , pages 4929–4952, Punta Cana, Dominican Republic, Nov

  15. [15]

    doi: 10.18653/v1/2021.findings-emnlp.424

    Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.424. URL https://aclanthology.org/2021.findings-emnlp.424. H. Le, Y . Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In NeurIPS,

  16. [16]

    Lester, R

    B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tun- ing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, pages 3045–3059, Online and Punta Cana, Dominican Republic, Nov

  17. [17]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Associ- ation for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https: //aclanthology.org/2021.emnlp-main.243. A. Lewkowycz, A. J. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra. Solving quantitative reasoning problems wit...

  18. [18]

    J. Li, D. Li, C. Xiong, and S. C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR, 2022a. R. Li, L. B. Allal, Y . Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. ...

  19. [19]

    Y . Li, D. H. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals. Competition-level code gener...

  20. [20]

    X. Liu, Y . Zheng, Z. Du, M. Ding, Y . Qian, Z. Yang, and J. Tang. Gpt understands, too.arXiv preprint arXiv:2103.10385,

  21. [21]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.8. URL https://aclanthology.org/2022.acl-short.8. Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized BERT pretraining approach.CoRR, abs/1907.11692,

  22. [22]

    Nguyen, N

    A. Nguyen, N. Karampatziakis, and W. Chen. Meet in the middle: A new pre-training paradigm. arXiv preprint arXiv:2303.07295,

  23. [23]

    A. Ni, J. P. Inala, C. Wang, O. Polozov, C. Meek, D. R. Radev, and J. Gao. Learning from self-sampled correct and partially-correct programs. CoRR, abs/2205.14318,

  24. [24]

    GPT-4 Technical Report

    19 OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,

  25. [25]

    URL https://aclanthology.org/2023.eacl-main.49

    Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.49. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,

  26. [26]

    URL https://huggingface.co/replit/replit-code-v1-3b . S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky, C. S. Prakash, M. Sridhar, F. Triefenbach, A. Verma, G. Tür, and P. Natarajan. Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model. CoRR, abs/2208.01448,

  27. [27]

    Svajlenko, J

    J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution, pages 476–480. IEEE,

  28. [28]

    Y . Tay, M. Dehghani, V . Q. Tran, X. Garcia, D. Bahri, T. Schuster, H. S. Zheng, N. Houlsby, and D. Metzler. Unifying language learning paradigms. CoRR, abs/2205.05131,

  29. [29]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

  30. [30]

    X. Wang, Y . Wang, F. Mi, P. Zhou, Y . Wan, X. Liu, L. Li, H. Wu, J. Liu, and X. Jiang. Syn- cobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556, 2021a. X. Wang, Y . Wang, Y . Wan, J. Wang, P. Zhou, L. Li, H. Wu, and J. Liu. CODE-MVP: Learning to represent source code from multiple views with...

  31. [31]

    However, deploying such systems at scale requires careful consideration of various ethical aspects, as extensively discussed by Chen et al

    A Ethics Statement Advancements in code understanding and generation systems hold immense potential to create positive societal impacts by improving programming accessibility and enhancing developer productivity through natural language interfaces. However, deploying such systems at scale requires careful consideration of various ethical aspects, as exten...

  32. [32]

    The text-code contrastive loss from a corpusD of text-code pairs is defined as the cross-entropy H betweenp and y: Ltcc = 1 2 E(T,C )∼D[H(yt2c(T ), pt2c(T )) +H(yc2t(C), pc2t(C))] (3) Text-Code Matching activates the decoder with the bimodal matching functionality to predict whether a pair of text and code is positive (matched) or negative (unmatched). We ...

  33. [33]

    We experiment CodeT5+ with three major benchmarks: CodeSearchNet (CSN) [Husain et al., 2019], CosQA [Huang et al., 2021], and AdvTest [Lu et al., 2021]

    t5 = n1 * t4 t6 = t5 - n1 answer = t6 - t3 import math n0 = 100.0 n1 = 25.0 n2 = 6.0 n3 = 10.0 t0 = math.pi * n0**2 t1 = math.pi * n2**2 * n3 answer = t1 / t0 Figure 9: Predictions of our model on MathQA-Python D Downstream Task Finetuning Details D.1 Text-to-Code Retrieval Text-to-code retrieval (or code search), is the task of finding the best code sampl...

  34. [34]

    CosQA and AdvTest are two related benchmarks that are derived from the CSN data. Specifically, instead of natural language queries, CosQA uses logs from Microsoft Bing search engine as queries, each of which is annotated by 3 human annotators [Huang et al., 2021]. AdvTest is created from the 24 Python split of the CSN data but the code samples are normaliz...

  35. [35]

    D.2 Code Summarization Code summarization is the task of generating a natural language summary of a code snippet

    For momentum encoders, we maintain a separate text/code queue with a size of 57600, and allow the matching decoder to retrieve 64 hard negatives from the queues for hard negative mining. D.2 Code Summarization Code summarization is the task of generating a natural language summary of a code snippet. We use the task dataset from CodeXGLUE [Lu et al., 2021]...

  36. [36]

    For training, we set the learning rate as 2e-5, the batch size as 32, and the max sequence length as 512 to finetune the model for 10 epochs

    and adopt 80%/10%/10% of the dataset as the training/validation/test split. For training, we set the learning rate as 2e-5, the batch size as 32, and the max sequence length as 512 to finetune the model for 10 epochs. D.4 Code Clone Detection The task of clone detection aims to detect whether any two code samples have the same functionality or semantics. W...

  37. [37]

    We conduct experiments on line-level code completion using two major benchmarks: PY150 [Raychev et al., 2016] and JavaCorpus [Allamanis and Sutton, 2013]

    D.5 Code Completion In code completion, given a source sequence containing a partial code sample, a model is required to generate the remaining part of the code sample. We conduct experiments on line-level code completion using two major benchmarks: PY150 [Raychev et al., 2016] and JavaCorpus [Allamanis and Sutton, 2013]. PY150 [Raychev et al., 2016] cons...

  38. [38]

    The average numbers of tokens in the source sequence and target sequence are 489.1 and 6.6 respectively

    selected 10,000 samples from different files from the test set of PY150 and then randomly sampled lines to be predicted for the code completion task. The average numbers of tokens in the source sequence and target sequence are 489.1 and 6.6 respectively. JavaCorpus [Allamanis and Sutton, 2013] contains over 14,000 Java projects collected from GitHub. Simil...

  39. [39]

    Compared to conventional code generation tasks, this task focuses more on computational reasoning skills

    25 D.6 Math Programming Math Programming is the task of solving maths-based problems with programming. Compared to conventional code generation tasks, this task focuses more on computational reasoning skills. The problem descriptions in this type of task are also more complex than conventional code generation tasks. We employ two major benchmarks for this...

  40. [40]

    In total, MathQA-Python contains∼24,000 problems, including 19,209/2,822/1,883 samples for training/validation/test splits

    translated these programs into Python programs and filtered for cleaner problems. In total, MathQA-Python contains∼24,000 problems, including 19,209/2,822/1,883 samples for training/validation/test splits. GradeSchool-Math [Cobbe et al., 2021] (also known as GSM8K) has similar nature as MathQA. The benchmark focuses on problems with moderate difficulty that...

  41. [41]

    benchmark following Parvez et al. [2021]. Specifically, we leverage the encoder to encode the code snippet in the retrieval base and build a search index with the faiss library [Johnson et al., 2019]. The search index is a set of representations (of 256 dimensions) for all the code snippets in the retrieval codebase. Let(xi,y i) denote one training instanc...