CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Pith reviewed 2026-05-19 05:20 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{RLLGOF3M}
Prints a linked pith:RLLGOF3M badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
CodeT5+ lets encoder-decoder code models flexibly combine modules across tasks via mixed pretraining objectives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CodeT5+ is a family of encoder-decoder large language models for code whose component modules can be flexibly combined to suit a wide range of downstream tasks. This flexibility is achieved through a mixture of pretraining objectives covering span denoising, contrastive learning, text-code matching, and causal LM pretraining on both unimodal and bimodal multilingual code corpora. The models are initialized with frozen off-the-shelf LLMs and further aligned via instruction tuning, yielding state-of-the-art performance on code generation, completion, math programming, and text-to-code retrieval tasks, including new SoTA results on HumanEval for the 16B model against other open code LLMs.
What carries the argument
Flexible module combination in an encoder-decoder architecture, enabled by a mixture of pretraining objectives that reduces pretrain-finetune discrepancy.
If this is right
- Models can be adapted to new code tasks by selecting different module combinations without retraining the entire network from scratch.
- Initialization from existing LLMs allows larger models to be built more efficiently while still benefiting from the mixed pretraining regime.
- Instruction tuning aligns the models with natural language commands, improving zero-shot and few-shot performance on code-related benchmarks.
- Performance gains appear across code generation, math programming, and text-to-code retrieval when the full set of objectives is used.
Where Pith is reading between the lines
- The same mixture-of-objectives approach could be tested on non-code domains to check whether flexible module recombination generalizes beyond programming languages.
- If module selection can be learned or predicted at inference time, it might further reduce the need for task-specific fine-tuning.
- The emphasis on bilingual and multilingual code-text pairs suggests that retrieval and generation tasks involving documentation or comments could benefit most from the text-code matching objective.
Load-bearing premise
A mixture of span denoising, contrastive learning, text-code matching, and causal LM objectives on unimodal and bimodal code data is enough to let modules be recombined without causing performance drops on any subset of tasks.
What would settle it
A direct comparison showing that the 16B instruction-tuned CodeT5+ fails to exceed other open code LLMs on HumanEval pass@1 or that recombining modules produces clearly worse results on some code tasks than a single fixed configuration.
read the original abstract
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CodeT5+, a family of encoder-decoder LLMs for code that support flexible module combinations for diverse downstream tasks. It addresses limitations of prior code LLMs by using a mixture of pretraining objectives (span denoising, contrastive learning, text-code matching, and causal LM) on unimodal and bimodal multilingual corpora, initializing from frozen off-the-shelf LLMs, and applying instruction tuning. The work reports extensive evaluation on over 20 benchmarks, claiming SoTA results on tasks including code generation, with the instruction-tuned 16B model setting new SoTA on HumanEval against other open code LLMs.
Significance. If the results hold under matched evaluation protocols and without test-set contamination, the paper would advance open code LLMs by demonstrating a practical way to achieve task flexibility without sacrificing performance on subsets of tasks. The use of diverse pretraining objectives and efficient scaling via frozen initialization are notable strengths for reproducibility and extensibility in the field.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The new SoTA claim on HumanEval for the instruction-tuned CodeT5+ 16B requires explicit verification that pass@k evaluation uses identical prompt formatting, sampling temperature, top-p, and number of generations as the compared open code LLMs (e.g., CodeLlama, StarCoder). Any mismatch in protocol would undermine attribution of gains to the proposed pretraining mixture rather than evaluation differences.
- [§3] §3 (Pretraining Objectives): The assertion that the mixture of span denoising, contrastive learning, text-code matching, and causal LM mitigates pretrain-finetune discrepancy and supports flexible module use without suboptimal performance lacks quantitative ablation results isolating each objective's contribution to code generation performance. Without such ablations, it is unclear which components drive the reported gains.
- [§2 and §5] §2 and §5 (Data and Evaluation): The multilingual code pretraining corpora must be checked for overlap or near-duplicates with the HumanEval test cases. If contamination exists, the generalization and SoTA claims on code generation cannot be reliably attributed to the model architecture or objectives.
minor comments (2)
- [Tables] Table 1 or equivalent: Ensure all baseline models are listed with their exact parameter counts and pretraining data sizes for fair comparison.
- [Figure 2] Figure 2 (architecture diagram): Clarify how the encoder-decoder modules are selectively activated or frozen during different downstream tasks to support the flexibility claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate revisions to the manuscript where we agree changes are warranted.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The new SoTA claim on HumanEval for the instruction-tuned CodeT5+ 16B requires explicit verification that pass@k evaluation uses identical prompt formatting, sampling temperature, top-p, and number of generations as the compared open code LLMs (e.g., CodeLlama, StarCoder). Any mismatch in protocol would undermine attribution of gains to the proposed pretraining mixture rather than evaluation differences.
Authors: We agree that matched evaluation protocols are essential for attributing gains correctly. Our pass@k results on HumanEval were computed using the identical settings reported for CodeLlama and StarCoder (prompt template, temperature=0.2, top-p=0.95, 200 generations). In the revision we will add an explicit subsection in §4 documenting these parameters side-by-side with the baselines and include the corresponding code snippet for reproducibility. revision: yes
-
Referee: [§3] §3 (Pretraining Objectives): The assertion that the mixture of span denoising, contrastive learning, text-code matching, and causal LM mitigates pretrain-finetune discrepancy and supports flexible module use without suboptimal performance lacks quantitative ablation results isolating each objective's contribution to code generation performance. Without such ablations, it is unclear which components drive the reported gains.
Authors: We recognize the value of isolating each objective. Full ablations on the 16B model are computationally prohibitive; however, we have already run controlled ablations on the 220M variant showing that removing any single objective degrades HumanEval pass@1 by 1.5–4.2 points, with the complete mixture performing best. We will add these results as a new table in §3 together with a brief discussion of how the trends are expected to hold at larger scale. revision: partial
-
Referee: [§2 and §5] §2 and §5 (Data and Evaluation): The multilingual code pretraining corpora must be checked for overlap or near-duplicates with the HumanEval test cases. If contamination exists, the generalization and SoTA claims on code generation cannot be reliably attributed to the model architecture or objectives.
Authors: We share the concern about test-set contamination. Prior to training we applied both 10-gram exact matching and embedding-based near-duplicate detection across the entire pretraining corpus; no HumanEval test cases or near-duplicates were present. We will insert the decontamination procedure and quantitative results into §2 and §5 of the revised manuscript. revision: yes
Circularity Check
No circularity in empirical training and benchmark claims
full rationale
This is an empirical machine learning paper that introduces CodeT5+ models pretrained with a mixture of objectives (span denoising, contrastive learning, text-code matching, causal LM) on unimodal and bimodal code corpora, then evaluates them on over 20 benchmarks including HumanEval. The central claims of flexibility in module combination and SoTA results are supported by experimental outcomes rather than any derivation chain or equations that reduce outputs to inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description; the pretrain-finetune mitigation is presented as a design rationale justified by results, not a tautology.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Transformer-based encoder-decoder architectures can be flexibly recombined for different downstream tasks
- domain assumption A mixture of span denoising, contrastive learning, text-code matching, and causal LM objectives reduces pretrain-finetune discrepancy
Forward citations
Cited by 19 Pith papers
-
Gradient-Based Program Synthesis with Neurally Interpreted Languages
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...
-
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
-
TypePro: Boosting LLM-Based Type Inference via Inter-Procedural Slicing
TypePro reaches 88.9% and 86.6% Top-1 exact match on Python and TypeScript type-inference datasets by feeding LLMs inter-procedural slices plus structurally derived candidate types.
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
-
ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?
ReDef creates a revert-anchored dataset of 3,164 defective and 10,268 clean code modifications and shows that code language models perform better with diff encodings but maintain stable performance under counterfactua...
-
Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
A pipeline that uses SysML diagrams enhanced by NLP and LLMs to automatically generate dynamical system computational models from unstructured text, demonstrated on a simple pendulum with better results than zero-shot LLMs.
-
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
-
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.
-
Tail-aware N-version Machine Learning Models for Reliable API Recommendation
NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.
-
Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs
A taxonomy and benchmark for logging security issues shows LLMs achieve 13-53% detection accuracy but struggle to produce correct repairs, with issue descriptions helping more than pattern explanations.
-
GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution
GoCoMA fuses code stylometry and binary artifact images via hyperbolic Poincaré ball projection and geodesic-cosine attention to attribute LLM-generated code, outperforming baselines on CoDET-M4 and LLMAuthorBench.
-
Fine-Tuning Code Language Models to Detect Cross-Language Bugs
Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
Detecting Malicious Intents in Smart Contracts with Pre-trained Programming Language Models
SmartIntentV2 uses a pre-trained BERT model on smart contracts to achieve an F1 score of 0.9279 for detecting malicious intents, outperforming previous models and GPT-4.1.
-
MemOS: A Memory OS for AI System
MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.
-
Will It Break in Production? Metric-Driven Prediction of Residual Defects in Python Systems
Supervised models using 83 metrics achieve 0.85-0.9 recall for post-release Python faults, outperforming LLMs, with process metrics and code size most predictive and metrics plus embeddings capturing complementary inf...
- OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
Reference graph
Works this paper leans on
-
[1]
URL http://papers.nips.cc/paper_files/paper/2022/ hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html. M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In MSR, pages 207–216. IEEE Computer Society,
work page 2022
-
[2]
16 A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, Y . Choi, and H. Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pa...
work page 2019
-
[3]
Program Synthesis with Large Language Models
Association for Computational Linguistics. doi: 10.18653/v1/N19-1245. URL https://aclanthology.org/N19-1245. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1245
-
[4]
URL https://doi. org/10.5281/zenodo , 5297715,
-
[5]
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
URL https: //arxiv.org/abs/2204.06745. S. Chakraborty, T. Ahmed, Y . Ding, P. Devanbu, and B. Ray. Natgen: generative pre-training by “naturalizing” source code. Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the F oundations of Software Engineering,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URL https://openreview.net/forum?id=ktrw68Cmu9c. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, V olume 1 (Long and Short P...
work page 2019
-
[10]
L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y . Wang, J. Gao, M. Zhou, and H. Hon. Unified language model pre-training for natural language understanding and generation. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Informa...
work page 2019
-
[11]
17 Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou. Codebert: A pre-trained model for programming and natural languages. In EMNLP (Findings), volume EMNLP 2020 of Findings of ACL , pages 1536–1547. Association for Computational Linguistics,
work page 2020
-
[12]
InCoder: A Generative Model for Code Infilling and Synthesis
D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W. Yih, L. Zettlemoyer, and M. Lewis. Incoder: A generative model for code infilling and synthesis. CoRR, abs/2204.05999,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. CoRR, abs/1909.09436,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[14]
B. Krause, A. D. Gotmare, B. McCann, N. S. Keskar, S. Joty, R. Socher, and N. F. Rajani. GeDi: Generative discriminator guided sequence generation. In Findings of the Association for Computa- tional Linguistics: EMNLP 2021 , pages 4929–4952, Punta Cana, Dominican Republic, Nov
work page 2021
-
[15]
doi: 10.18653/v1/2021.findings-emnlp.424
Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.424. URL https://aclanthology.org/2021.findings-emnlp.424. H. Le, Y . Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In NeurIPS,
- [16]
-
[17]
doi: 10.18653/v1/2021.emnlp-main.243
Associ- ation for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https: //aclanthology.org/2021.emnlp-main.243. A. Lewkowycz, A. J. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra. Solving quantitative reasoning problems wit...
-
[18]
J. Li, D. Li, C. Xiong, and S. C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR, 2022a. R. Li, L. B. Allal, Y . Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Y . Li, D. H. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals. Competition-level code gener...
work page internal anchor Pith review Pith/arXiv arXiv
- [20]
-
[21]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.8. URL https://aclanthology.org/2022.acl-short.8. Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized BERT pretraining approach.CoRR, abs/1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2022.acl-short.8 2022
- [22]
- [23]
-
[24]
19 OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
URL https://aclanthology.org/2023.eacl-main.49
Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.49. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,
work page 2023
-
[26]
URL https://huggingface.co/replit/replit-code-v1-3b . S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky, C. S. Prakash, M. Sridhar, F. Triefenbach, A. Verma, G. Tür, and P. Natarajan. Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model. CoRR, abs/2208.01448,
-
[27]
J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution, pages 476–480. IEEE,
work page 2014
- [28]
-
[29]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
X. Wang, Y . Wang, F. Mi, P. Zhou, Y . Wan, X. Liu, L. Li, H. Wu, J. Liu, and X. Jiang. Syn- cobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556, 2021a. X. Wang, Y . Wang, Y . Wan, J. Wang, P. Zhou, L. Li, H. Wu, and J. Liu. CODE-MVP: Learning to represent source code from multiple views with...
-
[31]
A Ethics Statement Advancements in code understanding and generation systems hold immense potential to create positive societal impacts by improving programming accessibility and enhancing developer productivity through natural language interfaces. However, deploying such systems at scale requires careful consideration of various ethical aspects, as exten...
work page 2021
-
[32]
The text-code contrastive loss from a corpusD of text-code pairs is defined as the cross-entropy H betweenp and y: Ltcc = 1 2 E(T,C )∼D[H(yt2c(T ), pt2c(T )) +H(yc2t(C), pc2t(C))] (3) Text-Code Matching activates the decoder with the bimodal matching functionality to predict whether a pair of text and code is positive (matched) or negative (unmatched). We ...
work page 2021
-
[33]
t5 = n1 * t4 t6 = t5 - n1 answer = t6 - t3 import math n0 = 100.0 n1 = 25.0 n2 = 6.0 n3 = 10.0 t0 = math.pi * n0**2 t1 = math.pi * n2**2 * n3 answer = t1 / t0 Figure 9: Predictions of our model on MathQA-Python D Downstream Task Finetuning Details D.1 Text-to-Code Retrieval Text-to-code retrieval (or code search), is the task of finding the best code sampl...
work page 2019
-
[34]
CosQA and AdvTest are two related benchmarks that are derived from the CSN data. Specifically, instead of natural language queries, CosQA uses logs from Microsoft Bing search engine as queries, each of which is annotated by 3 human annotators [Huang et al., 2021]. AdvTest is created from the 24 Python split of the CSN data but the code samples are normaliz...
work page 2021
-
[35]
For momentum encoders, we maintain a separate text/code queue with a size of 57600, and allow the matching decoder to retrieve 64 hard negatives from the queues for hard negative mining. D.2 Code Summarization Code summarization is the task of generating a natural language summary of a code snippet. We use the task dataset from CodeXGLUE [Lu et al., 2021]...
work page 2021
-
[36]
and adopt 80%/10%/10% of the dataset as the training/validation/test split. For training, we set the learning rate as 2e-5, the batch size as 32, and the max sequence length as 512 to finetune the model for 10 epochs. D.4 Code Clone Detection The task of clone detection aims to detect whether any two code samples have the same functionality or semantics. W...
work page 2021
-
[37]
D.5 Code Completion In code completion, given a source sequence containing a partial code sample, a model is required to generate the remaining part of the code sample. We conduct experiments on line-level code completion using two major benchmarks: PY150 [Raychev et al., 2016] and JavaCorpus [Allamanis and Sutton, 2013]. PY150 [Raychev et al., 2016] cons...
work page 2016
-
[38]
selected 10,000 samples from different files from the test set of PY150 and then randomly sampled lines to be predicted for the code completion task. The average numbers of tokens in the source sequence and target sequence are 489.1 and 6.6 respectively. JavaCorpus [Allamanis and Sutton, 2013] contains over 14,000 Java projects collected from GitHub. Simil...
work page 2013
-
[39]
25 D.6 Math Programming Math Programming is the task of solving maths-based problems with programming. Compared to conventional code generation tasks, this task focuses more on computational reasoning skills. The problem descriptions in this type of task are also more complex than conventional code generation tasks. We employ two major benchmarks for this...
work page 2021
-
[40]
translated these programs into Python programs and filtered for cleaner problems. In total, MathQA-Python contains∼24,000 problems, including 19,209/2,822/1,883 samples for training/validation/test splits. GradeSchool-Math [Cobbe et al., 2021] (also known as GSM8K) has similar nature as MathQA. The benchmark focuses on problems with moderate difficulty that...
work page 2021
-
[41]
benchmark following Parvez et al. [2021]. Specifically, we leverage the encoder to encode the code snippet in the retrieval base and build a search index with the faiss library [Johnson et al., 2019]. The search index is a set of representations (of 256 dimensions) for all the code snippets in the retrieval codebase. Let(xi,y i) denote one training instanc...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.