arxiv: 2109.00859 · v1 · submitted 2021-09-02 · 💻 cs.CL · cs.PL

Recognition: 2 theorem links

· Lean Theorem

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Yue Wang , Weishi Wang , Shafiq Joty , Steven C.H. Hoi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:17 UTC · model grok-4.3

classification 💻 cs.CL cs.PL

keywords code pre-trainingidentifier awarenessencoder-decodercode understandingcode generationbimodal alignmenttransformer model

0 comments

The pith

CodeT5 is a unified encoder-decoder model that pre-trains by distinguishing and recovering developer-assigned identifiers to handle both code understanding and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CodeT5 as a single pre-trained Transformer that supports code understanding and generation through one framework. It adds an identifier-aware pre-training objective so the model learns which tokens are developer names and can recover them after masking. It also adds a bimodal dual generation task that uses code comments to align natural language and programming language representations. These choices are meant to address how code differs from ordinary text, especially in token types and semantic roles. If the approach works, it should improve results on defect detection, clone detection, and translation tasks in multiple directions.

Core claim

CodeT5 uses a unified encoder-decoder architecture together with an identifier-aware pre-training task that trains the model to recognize identifiers among code tokens and recover them when masked, plus a bimodal dual generation task that pairs code with its comments, allowing the same model to excel at understanding tasks such as defect and clone detection as well as generation tasks in PL-NL, NL-PL, and PL-PL directions.

What carries the argument

The identifier-aware pre-training task, which teaches the model to distinguish developer-assigned identifiers from other tokens and to recover masked identifiers.

Load-bearing premise

The performance gains from the identifier-aware pre-training objective and the bimodal dual generation task will hold on new datasets and different fine-tuning setups.

What would settle it

A controlled experiment that applies the same fine-tuning regime to CodeT5 and prior models on a fresh set of code repositories and finds no accuracy advantage for CodeT5 on defect detection or code summarization.

read the original abstract

Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code. Our code and pre-trained models are released at https: //github.com/salesforce/CodeT5 .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeT5 adds identifier-aware masking and bimodal comment-code generation to a T5 encoder-decoder and reports better numbers on defect detection plus generation tasks, but the gains need scale-matched controls to pin down what the new objectives actually contribute.

read the letter

The main point is that CodeT5 takes the T5 setup and layers on two code-specific pre-training tasks: one that masks and recovers only identifiers, and a dual generation task that trains both code-to-comment and comment-to-code directions. It reports stronger results than prior models on defect detection, clone detection, and several PL-NL and NL-PL generation benchmarks, while keeping a single model for both understanding and generation work.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CodeT5, a unified pre-trained encoder-decoder Transformer model for code understanding and generation. It proposes an identifier-aware pre-training objective that distinguishes and recovers masked identifiers, plus a bimodal dual generation task leveraging code comments for NL-PL alignment. The central claim is that CodeT5 significantly outperforms prior encoder-only and encoder-decoder baselines on understanding tasks (defect detection, clone detection) and generation tasks (PL-NL, NL-PL, PL-PL), with analysis indicating better semantic capture from code.

Significance. If the reported gains prove attributable to the identifier-aware and bimodal objectives rather than uncontrolled differences in scale or data, the work would advance unified pre-training for code by demonstrating the value of explicitly modeling identifier semantics and comment-code alignment, providing a reusable framework and released models that could serve as stronger baselines for downstream code tasks.

major comments (2)

[Experimental Results] Experimental section (results tables on CodeSearchNet-derived tasks): the reported 2-5 point lifts over CodeBERT and PLBART are not accompanied by controlled comparisons that match model parameter count, pre-training corpus size, and fine-tuning schedule exactly to the strongest baselines; without such controls it remains unclear whether the gains are driven by the new identifier-aware and bimodal objectives or by other factors.
[Ablation Studies] Ablation studies subsection: there is no explicit ablation that isolates the contribution of the identifier-aware masked identifier prediction task from the bimodal dual generation task or from the base encoder-decoder architecture, weakening the attribution of improvements to the proposed pre-training innovations.

minor comments (2)

[Abstract] The abstract states that 'comprehensive experiments show' outperformance but does not list the exact set of tasks, datasets, or number of runs used to support the significance claims.
[Pre-training Objectives] Notation for the bimodal generation loss weight and identifier masking rate is introduced without an explicit equation or hyper-parameter table entry, making it harder to reproduce the exact training objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our experimental controls and ablation studies. We address each major comment below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Experimental Results] Experimental section (results tables on CodeSearchNet-derived tasks): the reported 2-5 point lifts over CodeBERT and PLBART are not accompanied by controlled comparisons that match model parameter count, pre-training corpus size, and fine-tuning schedule exactly to the strongest baselines; without such controls it remains unclear whether the gains are driven by the new identifier-aware and bimodal objectives or by other factors.

Authors: We agree that perfectly matched controls would strengthen attribution. Our comparisons follow the standard practice of using the model sizes, pre-training corpora, and fine-tuning schedules exactly as reported in the original CodeBERT and PLBART papers (CodeT5 has 220M parameters, comparable to PLBART). Table 1 already lists parameter counts. The gains are consistent across five task families and multiple datasets, and CodeT5 also outperforms larger models on some metrics. In revision we will add an expanded discussion paragraph and a supplementary table explicitly comparing pre-training data sizes and hyper-parameters across all baselines to make these factors transparent. revision: partial
Referee: [Ablation Studies] Ablation studies subsection: there is no explicit ablation that isolates the contribution of the identifier-aware masked identifier prediction task from the bimodal dual generation task or from the base encoder-decoder architecture, weakening the attribution of improvements to the proposed pre-training innovations.

Authors: Section 4.3 already reports ablations removing the identifier-aware objective (showing 1.5–3.2 point drops on defect and clone detection) and compares the full model against a T5 encoder-decoder baseline without our tasks. To directly isolate both contributions, we will add a new ablation table in the revision that includes four variants: (1) base encoder-decoder, (2) + identifier-aware only, (3) + bimodal dual generation only, and (4) both objectives. This will make the incremental value of each innovation explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external empirical comparisons

full rationale

The paper introduces CodeT5 with identifier-aware pre-training and bimodal generation objectives, then reports performance on CodeSearchNet-derived tasks and others via direct comparison to published baselines such as CodeBERT and PLBART. No equations, fitted parameters, or self-citations are used to derive the central performance claims; the results are presented as measured outcomes on held-out test sets. The derivation chain consists of standard pre-training followed by fine-tuning and evaluation, with no step that reduces a reported gain to an input quantity by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The paper relies on standard Transformer pre-training assumptions and the empirical claim that identifier tokens carry special semantic weight in code. No new physical or mathematical entities are postulated.

free parameters (2)

identifier masking rate
Chosen hyper-parameter that controls how often identifiers are masked during pre-training.
bimodal generation loss weight
Scaling factor balancing the dual generation objective against other pre-training losses.

axioms (2)

domain assumption Identifiers chosen by developers carry recoverable semantic information beyond syntactic structure.
Invoked to justify the identifier-aware masking task.
domain assumption Natural-language comments are sufficiently aligned with code to support bidirectional generation.
Required for the bimodal dual generation task.

pith-pipeline@v0.9.0 · 5557 in / 1376 out tokens · 20678 ms · 2026-05-15T11:17:47.127871+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
cs.SE 2026-05 accept novelty 7.0

LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
cs.SE 2026-05 unverdicted novelty 7.0

HEJ-Robust benchmark shows LLM-based program repair models drop over 50% in accuracy when buggy code is rewritten with equivalent syntax.
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
cs.LG 2026-05 unverdicted novelty 7.0

InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
cs.SE 2026-04 unverdicted novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
cs.SE 2026-04 conditional novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
PlayCoder: Making LLM-Generated GUI Code Playable
cs.SE 2026-04 conditional novelty 7.0

PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
cs.SE 2026-04 unverdicted novelty 7.0

SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
TypePro: Boosting LLM-Based Type Inference via Inter-Procedural Slicing
cs.SE 2026-04 unverdicted novelty 7.0

TypePro reaches 88.9% and 86.6% Top-1 exact match on Python and TypeScript type-inference datasets by feeding LLMs inter-procedural slices plus structurally derived candidate types.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System
cs.AI 2026-05 unverdicted novelty 6.0

MAS-Algorithm is a multi-agent workflow that improves AI acceptance rates on algorithmic problems by 6.48% on average, outperforming parameter-efficient fine-tuning.
MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System
cs.AI 2026-05 unverdicted novelty 6.0

MAS-Algorithm is a multi-agent workflow that raises acceptance rates on algorithmic problems by 6.48% on average over baseline models.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Root-Cause-Driven Automated Vulnerability Repair
cs.CR 2026-05 unverdicted novelty 6.0

Kumushi improves automated vulnerability repair by focusing LLM edits on root causes via dynamic localization and ranking, yielding more genuine fixes than prior agents on 178 C/C++ vulnerabilities.
Mitigating False Positives in Static Memory Safety Analysis of Rust Programs via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 6.0

Reinforcement learning on MIR features with fuzz testing feedback reduces false positives in Rust static memory safety analysis, raising precision from 25.6% to 59% and accuracy to 65.2% while keeping 74.6% recall.
Mitigating False Positives in Static Memory Safety Analysis of Rust Programs via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 6.0

Reinforcement learning on MIR features combined with cargo-fuzz validation reduces false positives in Rust static memory safety analysis, raising precision from 25.6% to 59.0% and accuracy to 65.2%.
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
cs.LG 2026-05 unverdicted novelty 6.0

InvEvolve uses LLMs and RL to generate certified inventory policies that outperform classical and deep learning methods on synthetic and real data while providing multi-period performance guarantees.
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
cs.SE 2026-04 unverdicted novelty 6.0

Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection
cs.SE 2026-04 unverdicted novelty 5.0

Controlled experiments show PLM-GNN hybrids improve code tasks over GNN-only baselines, with PLM source having larger impact than GNN backbone.
Learning Project-wise Subsequent Code Edits via Interleaving Neural-based Induction and Tool-based Deduction
cs.SE 2026-04 unverdicted novelty 5.0

TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.
EcoAssist: Embedding Sustainability into AI-Assisted Frontend Development
cs.HC 2026-04 unverdicted novelty 5.0

EcoAssist embeds energy estimation and optimization into AI-assisted frontend coding, reducing website energy use by 13-16% in benchmarks while preserving developer productivity.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
cs.SE 2024-01 unverdicted novelty 5.0

DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
Prompt-Driven Code Summarization: A Systematic Literature Review
cs.SE 2026-04 unverdicted novelty 4.0

A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.
Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection
cs.CR 2026-04 unverdicted novelty 3.0

Fine-tuned LLMs generate obfuscated XSS payloads with only a 22% runtime behavior match rate, and adding them does not improve machine learning-based XSS detection.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 19 Pith papers · 7 internal anchors

[2]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Le, and Christopher D

Kevin Clark, Minh - Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. https://openreview.net/forum?id=r1xMH1BtvB ELECTRA: pre-training text encoders as discriminators rather than generators . In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net

work page 2020
[5]

Alexis Conneau and Guillaume Lample. 2019. https://proceedings.neurips.cc/paper/2019/hash/c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html Cross-lingual language model pretraining . In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canad...

work page 2019
[7]

Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://www.aclweb.org/anthology/N19-1423/ BERT: pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ...

work page 2019
[8]

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao - Wuen Hon. 2019. https://proceedings.neurips.cc/paper/2019/hash/c20bb2d9a50d5ac1f713f8b34d9aac5a-Abstract.html Unified language model pre-training for natural language understanding and generation . In Advances in Neural Information Processing Systems 32: ...

work page 2019
[9]

Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, and Burkhard Rost. 2021. http://arxiv.org/abs/2104.02443 Codetrans: Towards cracking the language of silicone's code through self-supervised deep learning and high performance computing . CoRR, abs/2104.02443

work page arXiv 2021
[11]

Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. https://openreview.net/forum?id=jLoC4ez43PZ Graphcodebert: Pre-training code representations with data flow . In ...

work page 2021
[12]

Hamel Husain, Ho - Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. http://arxiv.org/abs/1909.09436 Codesearchnet challenge: Evaluating the state of semantic code search . CoRR, abs/1909.09436

work page internal anchor Pith review Pith/arXiv arXiv 2019
[14]

Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. http://proceedings.mlr.press/v119/kanade20a.html Learning and evaluating contextual embedding of source code . In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of Machine Learning Research,...

work page 2020
[16]

Chin - Yew Lin and Franz Josef Och. 2004. https://www.aclweb.org/anthology/C04-1072/ ORANGE: a method for evaluating automatic evaluation metrics for machine translation . In COLING 2004, 20th International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2004, Geneva, Switzerland

work page 2004
[19]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019 b . http://arxiv.org/abs/1907.11692 Roberta: A robustly optimized BERT pretraining approach . CoRR, abs/1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. http://arxiv.org/abs/2102.04664 Codexglue: A machine learning b...

work page internal anchor Pith review arXiv 2021
[22]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf Language models are unsupervised multitask learners . OpenAI blog, 1(8):9

work page 2019
[23]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res., 21:140:1--140:67

work page 2020
[24]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. http://arxiv.org/abs/2009.10297 Codebleu: a method for automatic evaluation of code synthesis . CoRR, abs/2009.10297

work page internal anchor Pith review Pith/arXiv arXiv 2020
[25]

Baptiste Rozi \` e re, Marie - Anne Lachaux, Lowik Chanussot, and Guillaume Lample. 2020. https://proceedings.neurips.cc/paper/2020/hash/ed23fbf18c2cd35f8c7f8de44f85c08d-Abstract.html Unsupervised translation of programming languages . In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020,...

work page 2020
[26]

Baptiste Rozi \` e re, Marie - Anne Lachaux, Marc Szafraniec, and Guillaume Lample. 2021. http://arxiv.org/abs/2102.07492 DOBF: A deobfuscation pre-training objective for programming languages . CoRR, abs/2102.07492

work page arXiv 2021
[28]

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie - Yan Liu. 2019. http://proceedings.mlr.press/v97/song19d.html MASS: masked sequence to sequence pre-training for language generation . In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of Proceedings of Machine Lear...

work page 2019
[29]

Yu Sun, Shuohuan Wang, Yu - Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. http://arxiv.org/abs/1904.09223 ERNIE: enhanced representation through knowledge integration . CoRR, abs/1904.09223

work page internal anchor Pith review Pith/arXiv arXiv 2019
[32]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html Attention is all you need . In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Proces...

work page 2017
[34]

Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaoning Du, and Yang Liu. 2019. https://proceedings.neurips.cc/paper/2019/hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks . In Advances in Neural Information Processing Systems 32: Annual Confe...

work page 2019
[35]

u gner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, and Stephan G \

Daniel Z \" u gner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, and Stephan G \" u nnemann. 2021. https://openreview.net/forum?id=Xh5eMZVONGF Language-agnostic representation learning of source code from structure and context . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net

work page 2021
[36]

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks , booktitle =

Antonio Mastropaolo and Simone Scalabrino and Nathan Cooper and David Nader. Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks , booktitle =. 2021 , url =. doi:10.1109/ICSE43902.2021.00041 , timestamp =

work page doi:10.1109/icse43902.2021.00041 2021
[37]

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree , booktitle =

Wenhan Wang and Ge Li and Bo Ma and Xin Xia and Zhi Jin , editor =. Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree , booktitle =. 2020 , url =. doi:10.1109/SANER48275.2020.9054857 , timestamp =

work page doi:10.1109/saner48275.2020.9054857 2020
[38]

Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks , booktitle =

Yaqin Zhou and Shangqing Liu and Jing Kai Siow and Xiaoning Du and Yang Liu , editor =. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks , booktitle =. 2019 , url =

work page 2019
[39]

2019 , url =

Michele Tufano and Cody Watson and Gabriele Bavota and Massimiliano Di Penta and Martin White and Denys Poshyvanyk , title =. 2019 , url =. doi:10.1145/3340544 , timestamp =

work page doi:10.1145/3340544 2019
[40]

Language-Agnostic Representation Learning of Source Code from Structure and Context , booktitle =

Daniel Z. Language-Agnostic Representation Learning of Source Code from Structure and Context , booktitle =. 2021 , url =

work page 2021
[41]

Evaluating Large Language Models Trained on Code , journal =

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harrison Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and ...

work page 2021
[42]

Clement and Dawn Drain and Neel Sundaresan and Jian Yin and Daxin Jiang and Ming Zhou , title =

Daya Guo and Shuo Ren and Shuai Lu and Zhangyin Feng and Duyu Tang and Shujie Liu and Long Zhou and Nan Duan and Alexey Svyatkovskiy and Shengyu Fu and Michele Tufano and Shao Kun Deng and Colin B. Clement and Dawn Drain and Neel Sundaresan and Jian Yin and Daxin Jiang and Ming Zhou , title =. 9th International Conference on Learning Representations,. 202...

work page 2021
[43]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , journal =

Hamel Husain and Ho. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , journal =. 2019 , url =

work page 2019
[44]

Cross-lingual Language Model Pretraining , booktitle =

Alexis Conneau and Guillaume Lample , editor =. Cross-lingual Language Model Pretraining , booktitle =. 2019 , url =

work page 2019
[45]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,

Jacob Devlin and Ming. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =

work page 2019
[46]

Unified Language Model Pre-training for Natural Language Understanding and Generation , booktitle =

Li Dong and Nan Yang and Wenhui Wang and Furu Wei and Xiaodong Liu and Yu Wang and Jianfeng Gao and Ming Zhou and Hsiao. Unified Language Model Pre-training for Natural Language Understanding and Generation , booktitle =. 2019 , url =

work page 2019
[47]

de Souza and Nicolas Anquetil and K

Sergio Cozzetti B. de Souza and Nicolas Anquetil and K. A study of the documentation essential to software maintenance , booktitle =. 2005 , url =. doi:10.1145/1085313.1085331 , timestamp =

work page doi:10.1145/1085313.1085331 2005
[48]

Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

work page 2017
[49]

Manning , editor =

Thang Luong and Hieu Pham and Christopher D. Manning , editor =. Effective Approaches to Attention-based Neural Machine Translation , booktitle =. 2015 , url =. doi:10.18653/v1/d15-1166 , timestamp =

work page doi:10.18653/v1/d15-1166 2015
[50]

CoRR , volume =

Shuo Ren and Daya Guo and Shuai Lu and Long Zhou and Shujie Liu and Duyu Tang and Neel Sundaresan and Ming Zhou and Ambrosio Blanco and Shuai Ma , title =. CoRR , volume =. 2020 , url =

work page 2020
[51]

Mapping Language to Code in Programmatic Context , booktitle =

Srinivasan Iyer and Ioannis Konstas and Alvin Cheung and Luke Zettlemoyer , editor =. Mapping Language to Code in Programmatic Context , booktitle =. 2018 , url =. doi:10.18653/v1/d18-1192 , timestamp =

work page doi:10.18653/v1/d18-1192 2018
[52]

Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu , title =. ...

work page 2021
[53]

Multi-Task Deep Neural Networks for Natural Language Understanding , booktitle =

Xiaodong Liu and Pengcheng He and Weizhu Chen and Jianfeng Gao , editor =. Multi-Task Deep Neural Networks for Natural Language Understanding , booktitle =. 2019 , url =. doi:10.18653/v1/p19-1441 , timestamp =

work page doi:10.18653/v1/p19-1441 2019
[54]

Code Generation as a Dual Task of Code Summarization , booktitle =

Bolin Wei and Ge Li and Xin Xia and Zhiyi Fu and Zhi Jin , editor =. Code Generation as a Dual Task of Code Summarization , booktitle =. 2019 , url =

work page 2019
[55]

CoRR , volume =

Yu Sun and Shuohuan Wang and Yu. CoRR , volume =. 2019 , url =

work page 2019
[56]

8th International Conference on Learning Representations,

Kevin Clark and Minh. 8th International Conference on Learning Representations,. 2020 , url =

work page 2020
[57]

IntelliCode compose: code generation using transformer , booktitle =

Alexey Svyatkovskiy and Shao Kun Deng and Shengyu Fu and Neel Sundaresan , editor =. IntelliCode compose: code generation using transformer , booktitle =. 2020 , url =. doi:10.1145/3368089.3417058 , timestamp =

work page doi:10.1145/3368089.3417058 2020
[58]

Language-Agnostic Representation Learning of Source Code from Structure and Context , journal =

Daniel Z. Language-Agnostic Representation Learning of Source Code from Structure and Context , journal =. 2021 , url =

work page 2021
[59]

Proceedings of the 37th International Conference on Machine Learning,

Aditya Kanade and Petros Maniatis and Gogul Balakrishnan and Kensen Shi , title =. Proceedings of the 37th International Conference on Machine Learning,. 2020 , url =

work page 2020
[60]

Bornea and J

Luca Buratti and Saurabh Pujar and Mihaela A. Bornea and J. Scott McCarley and Yunhui Zheng and Gaetano Rossiello and Alessandro Morari and Jim Laredo and Veronika Thost and Yufan Zhuang and Giacomo Domeniconi , title =. CoRR , volume =. 2020 , url =

work page 2020
[61]

Fang Liu and Ge Li and Yunfei Zhao and Zhi Jin , title =. 35th. 2020 , url =. doi:10.1145/3324884.3416591 , timestamp =

work page doi:10.1145/3324884.3416591 2020
[62]

CoRR , volume =

Ahmed Elnaggar and Wei Ding and Llion Jones and Tom Gibbs and Tamas Feher and Christoph Angerer and Silvia Severini and Florian Matthes and Burkhard Rost , title =. CoRR , volume =. 2021 , url =

work page 2021
[63]

CoRR , volume =

Baptiste Rozi. CoRR , volume =. 2021 , url =

work page 2021
[64]

Clement and Dawn Drain and Jonathan Timcheck and Alexey Svyatkovskiy and Neel Sundaresan , editor =

Colin B. Clement and Dawn Drain and Jonathan Timcheck and Alexey Svyatkovskiy and Neel Sundaresan , editor =. PyMT5: multi-mode translation of natural language and Python code with transformers , booktitle =. 2020 , url =. doi:10.18653/v1/2020.emnlp-main.728 , timestamp =

work page doi:10.18653/v1/2020.emnlp-main.728 2020
[65]

HuggingFace's Transformers: State-of-the-art Natural Language Processing , journal =

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. HuggingFace's Transformers: State-of-the-art Natural Language Processing , journal =. 2019 , url =

work page 2019
[66]

Unsupervised Translation of Programming Languages , booktitle =

Baptiste Rozi. Unsupervised Translation of Programming Languages , booktitle =. 2020 , url =

work page 2020
[67]

Unified Pre-training for Program Understanding and Generation , booktitle =

Wasi Uddin Ahmad and Saikat Chakraborty and Baishakhi Ray and Kai. Unified Pre-training for Program Understanding and Generation , booktitle =. 2021 , url =. doi:10.18653/v1/2021.naacl-main.211 , timestamp =

work page doi:10.18653/v1/2021.naacl-main.211 2021
[68]

CodeBERT:

Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou , editor =. CodeBERT:. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings,. 2020 , url =. doi:10.18653/v1/2020.findings-emnlp.139 , timestamp =

work page doi:10.18653/v1/2020.findings-emnlp.139 2020
[69]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=. 2019 , url=

work page 2019
[70]

CoRR , volume =

Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , title =. CoRR , volume =. 2019 , url =

work page 2019
[71]

SentencePiece:

Taku Kudo and John Richardson , editor =. SentencePiece:. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,. 2018 , url =. doi:10.18653/v1/d18-2012 , timestamp =

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[72]

doi: 10.18653/v1/P16-1162

Rico Sennrich and Barry Haddow and Alexandra Birch , title =. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,. 2016 , url =. doi:10.18653/v1/p16-1162 , timestamp =

work page doi:10.18653/v1/p16-1162 2016
[73]

Proceedings of the 36th International Conference on Machine Learning,

Kaitao Song and Xu Tan and Tao Qin and Jianfeng Lu and Tie. Proceedings of the 36th International Conference on Machine Learning,. 2019 , url =

work page 2019
[74]

doi:10.18653/v1/2020.acl-main.703 , pages =

Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and Abdelrahman Mohamed and Omer Levy and Veselin Stoyanov and Luke Zettlemoyer , editor =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,. 2020 , url =. doi:10.18653/v1/2020.acl-main.703 , timestamp =

work page doi:10.18653/v1/2020.acl-main.703 2020
[75]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , volume =. 2020 , url =

work page 2020
[76]

Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

Guo, Daya and Tang, Duyu and Duan, Nan and Zhou, Ming and Yin, Jian. Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1082

work page doi:10.18653/v1/p19-1082 2019
[77]

Learning Programmatic Idioms for Scalable Semantic Parsing

Iyer, Srinivasan and Cheung, Alvin and Zettlemoyer, Luke. Learning Programmatic Idioms for Scalable Semantic Parsing. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1545

work page doi:10.18653/v1/d19-1545 2019
[78]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[79]

Publications Manual , year = "1983", publisher =

work page 1983
[80]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[81]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[82]

Dan Gusfield , title =. 1997

work page 1997
[83]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[84]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page