arxiv: 2009.08366 · v4 · submitted 2020-09-17 · 💻 cs.SE · cs.CL

Recognition: 1 theorem link

GraphCodeBERT: Pre-training Code Representations with Data Flow

Daya Guo , Shuo Ren , Shuai Lu , Zhangyin Feng , Duyu Tang , Shujie Liu , Long Zhou , Nan Duan

show 10 more authors

Alexey Svyatkovskiy Shengyu Fu Michele Tufano Shao Kun Deng Colin Clement Dawn Drain Neel Sundaresan Jian Yin Daxin Jiang Ming Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:41 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords pre-trained modelscode representationsdata flowprogramming languagescode searchclone detectioncode translationcode refinement

0 comments

The pith

GraphCodeBERT improves code understanding by pre-training on data flow edges that track where variable values come from.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing code models treat programs as token sequences and miss the relations that define their meaning. This work replaces deep syntax trees with a flatter data-flow graph that records definition-use links between variables. Two new pre-training tasks teach the model to predict those links and to align token representations with the graph nodes. The resulting Transformer, equipped with graph-guided attention, reaches state-of-the-art accuracy on code search, clone detection, translation, and refinement.

Core claim

GraphCodeBERT augments masked language modeling with edge prediction on the data-flow graph and cross-modal alignment between source code and the graph. The data-flow graph encodes semantic relations of the form 'where-the-value-comes-from' without the deep nesting of an abstract syntax tree. These structure-aware objectives are realized through an efficient graph-guided masked attention mechanism inside a Transformer, yielding measurable gains on four downstream code tasks.

What carries the argument

The data-flow graph, which links variables by their definition-use relations, together with graph-guided masked attention that lets the Transformer attend along those edges.

If this is right

Code models can capture semantic relations more efficiently by using flat data-flow graphs rather than deep parse trees.
Adding explicit structure-prediction and alignment objectives during pre-training produces measurable gains on search, detection, and repair tasks.
The graph-guided attention mechanism allows a standard Transformer to incorporate code structure at modest extra cost.
State-of-the-art results on four distinct tasks indicate that semantic structure transfers across code understanding problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-flow pre-training could be applied to languages beyond those tested to test whether the semantic edges are language-agnostic.
Hybrid models that combine data-flow edges with selected AST subtrees might further improve performance on tasks that require deep syntactic awareness.
Downstream tools such as automated program repair or code summarization may benefit from the richer variable-relation representations learned here.

Load-bearing premise

Data-flow edges supply enough semantic structure to improve code understanding without needing the full syntactic hierarchy of an abstract syntax tree.

What would settle it

A version of the model trained without any data-flow edges would match or exceed the full GraphCodeBERT on the four evaluation tasks.

read the original abstract

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphCodeBERT gets consistent gains on four code tasks by swapping data-flow edges for ASTs plus two new pre-training objectives.

read the letter

The main thing to know is that this model improves on standard code pre-training by using data-flow graphs instead of ASTs and adding edge prediction plus code-structure alignment tasks, with graph-guided masked attention to make it work inside a Transformer. The abstract reports steady lifts and SOTA results on code search, clone detection, translation, and refinement, plus a note that the model favors structure-level attention in search. That combination of semantic structure choice and the two new objectives is not in the prior work they cite, and the efficiency argument—avoiding deep syntax hierarchies—makes sense for practical use in search and completion tools. The empirical pattern looks clean from what is shown, with no obvious post-hoc exclusions or derivation gaps flagged. The paper does a straightforward job of motivating the data-flow choice as a neat semantic proxy for where values come from. One soft spot is that everything rests on empirical comparisons rather than any formal reduction, so the full paper needs to confirm the data-flow graph construction is stable across languages and that the baselines are matched fairly on compute and data. The assumption that data flow captures enough without deeper syntax is reasonable for speed but could use a tighter ablation to show it does not lose ground on tasks that might need more hierarchy. This is aimed at people working on code representations for software engineering and NLP applications. It deserves a serious referee because the method is testable, the benchmarks are standard, and the efficiency angle is useful for real models.

Referee Report

2 major / 2 minor

Summary. The paper introduces GraphCodeBERT, a Transformer-based pre-trained model for code that replaces syntactic AST structure with semantic data-flow edges encoding 'where-the-value-comes-from' relations between variables. It augments standard masked language modeling with two new structure-aware objectives (edge prediction and code-structure alignment) and implements them via graph-guided masked attention. The model is evaluated on code search, clone detection, code translation, and code refinement, where it reports state-of-the-art results and shows a preference for structure-level attention heads.

Significance. If the empirical gains are reproducible, the work demonstrates that a lightweight semantic graph (data flow) can deliver measurable improvements over token-only baselines and over deeper syntactic hierarchies while remaining computationally efficient. The explicit attention analysis and the two new pre-training tasks provide concrete, falsifiable evidence that structure-aware objectives transfer to downstream code tasks.

major comments (2)

[§4] §4 (Experimental Setup): the paper must report the exact pre-training corpus size, vocabulary construction, and whether all baselines were re-trained on identical data; without these details the SOTA claim on the four tasks cannot be verified as arising from the proposed structure components rather than data differences.
[§3.3] §3.3 (Graph-guided Masked Attention): the description of how data-flow edges are extracted from source code (e.g., via static analysis or heuristic rules) is insufficiently precise; a concrete algorithm or pseudocode is needed to ensure the structure is reproducible and not post-hoc tuned to the downstream tasks.

minor comments (2)

[Abstract] Abstract: the sentence 'can improve GraphCodeBERT and achieves state-of-the-art' is grammatically awkward and should be rephrased to clarify that the added components improve upon prior models.
[§3.1] Figure 1 or §3.1: the visualization of data-flow edges versus AST would benefit from an explicit side-by-side example on the same code snippet to illustrate the claimed reduction in hierarchy depth.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. The comments are constructive and focus on reproducibility, which we fully support. We address both major comments below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): the paper must report the exact pre-training corpus size, vocabulary construction, and whether all baselines were re-trained on identical data; without these details the SOTA claim on the four tasks cannot be verified as arising from the proposed structure components rather than data differences.

Authors: We agree that these experimental details are necessary for verifying that gains come from the proposed structure-aware components. The pre-training corpus, vocabulary construction, and baseline training protocols are identical to those in the CodeBERT paper on which GraphCodeBERT is built; we will add an explicit paragraph (or subsection) in §4 that states the exact corpus size, the BPE vocabulary construction procedure, and confirms that all reported baselines were either retrained or evaluated on the identical data splits and test sets. This revision will be made. revision: yes
Referee: [§3.3] §3.3 (Graph-guided Masked Attention): the description of how data-flow edges are extracted from source code (e.g., via static analysis or heuristic rules) is insufficiently precise; a concrete algorithm or pseudocode is needed to ensure the structure is reproducible and not post-hoc tuned to the downstream tasks.

Authors: We acknowledge that the current description of edge extraction is high-level and should be made fully reproducible. Data-flow edges are obtained via standard static reaching-definitions analysis on variable assignments and uses (not heuristics tuned to downstream tasks). In the revised manuscript we will insert a short algorithm box with pseudocode in §3.3 that outlines the steps: (1) parse the function, (2) identify variable definition and use sites, (3) compute reaching definitions, and (4) emit an edge from each definition to its reachable uses. The same deterministic procedure is applied uniformly during pre-training and downstream evaluation. This addition will be made. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are empirical

full rationale

The paper introduces GraphCodeBERT by replacing ASTs with data-flow edges and adding two pre-training objectives (edge prediction and code-structure alignment) inside a Transformer with graph-guided masked attention. All central claims are supported by downstream empirical results on code search, clone detection, translation, and refinement rather than any derivation, fitted parameter, or self-citation that reduces the result to its inputs by construction. No equations or steps equate a prediction to a fitted input; performance gains are measured against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The model inherits the standard Transformer architecture and masked-language-modeling objective; the only added assumptions are that data-flow edges are a useful semantic proxy and that the two new objectives can be optimized jointly with MLM.

axioms (1)

domain assumption Data-flow edges capture the essential semantic relations needed for code understanding tasks.
Stated in the abstract as the motivation for choosing data flow over AST.

pith-pipeline@v0.9.0 · 5644 in / 1171 out tokens · 21509 ms · 2026-05-15T08:41:29.056846+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Tool Cloning in Agentic-AI Ecosystems
cs.SE 2026-05 unverdicted novelty 7.0

Tool cloning is pervasive in agentic AI ecosystems, with 60% of high-Jaccard and 85% of high-ssdeep similar pairs verified as true clones in a study of over 8,800 repositories.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
cs.SE 2026-04 unverdicted novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
cs.SE 2026-04 unverdicted novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
Static Program Slicing Using Language Models With Dataflow-Aware Pretraining and Constrained Decoding
cs.SE 2026-04 unverdicted novelty 7.0

Sliceformer improves static program slicing accuracy by up to 22% ExactMatch on Java/Python benchmarks through dataflow-preserving pretraining and lexical/syntactic constrained decoding in language models.
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
cs.CL 2023-12 accept novelty 7.0

A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
cs.SE 2020-09 conditional novelty 7.0

CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification
cs.SE 2026-05 unverdicted novelty 6.0

NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.
VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection
cs.CR 2026-04 unverdicted novelty 6.0

VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.
Residual Risk Analysis in Benign Code: How Far Are We? A Multi-Model Semantic and Structural Similarity Approach
cs.SE 2026-04 unverdicted novelty 6.0

Patched functions often remain similar to vulnerable ones, and a new multi-model similarity scoring system identifies residual issues like null pointer dereferences in 61% of high-risk cases from the PrimeVul dataset.
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
cs.SE 2026-04 unverdicted novelty 6.0

Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis
cs.SE 2026-04 unverdicted novelty 6.0

A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vuln...
DiffHLS: Differential Learning for High-Level Synthesis QoR Prediction with GNNs and LLM Code Embeddings
cs.LG 2026-04 unverdicted novelty 6.0

DiffHLS predicts HLS QoR via differential learning: separate GNN+LLM models for kernel baseline and design delta are composed to yield the final estimate, showing lower MAPE than GNN baselines on PolyBench.
Static Program Slicing Using Language Models With Dataflow-Aware Pretraining and Constrained Decoding
cs.SE 2026-04 unverdicted novelty 6.0

Sliceformer reformulates static program slicing as seq2seq using CodeT5+ with dataflow-aware pretraining via DFG permutation and span corruption plus constrained decoding, yielding up to 22% ExactMatch gains on Java a...
AFGNN: API Misuse Detection using Graph Neural Networks and Clustering
cs.SE 2026-04 unverdicted novelty 6.0

AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.
On the Role of Fault Localization Context for LLM-Based Program Repair
cs.SE 2026-04 unverdicted novelty 6.0

More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.
DCVD: Dual-Channel Cross-Modal Fusion for Joint Vulnerability Detection and Localization
cs.CR 2026-05 unverdicted novelty 5.0

DCVD performs joint function-level vulnerability detection and statement-level localization by extracting control-dependency and semantic features in parallel branches, fusing them with contrastive alignment and bidir...
Learning Generalizable Multimodal Representations for Software Vulnerability Detection
cs.SE 2026-04 unverdicted novelty 5.0

MultiVul uses multimodal contrastive learning to align code and comment representations, yielding up to 27% F1 gains on vulnerability detection benchmarks over prompting and code-only baselines.
PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection
cs.SE 2026-04 unverdicted novelty 5.0

Controlled experiments show PLM-GNN hybrids improve code tasks over GNN-only baselines, with PLM source having larger impact than GNN backbone.
Prompt-Driven Code Summarization: A Systematic Literature Review
cs.SE 2026-04 unverdicted novelty 4.0

A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 19 Pith papers · 12 internal anchors

[1]

code2seq: Generating Sequences from Structured Representations of Code

Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Structural language models of code

Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of code. arXiv, pp. arXiv–1910,

work page 1910
[3]

Generative Code Modeling with Graphs

Marc Brockschmidt, Miltiadis Allamanis, Alexander L Gaunt, and Oleksandr Polozov. Generative code modeling with graphs. arXiv preprint arXiv:1805.08490,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Exploring software natural- ness throughneural language models

Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, et al. Exploring software natural- ness throughneural language models. arXiv preprint arXiv:2006.12641,

work page arXiv 2006
[5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[7]

Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

Daya Guo, Duyu Tang, Nan Duan, M. Zhou, and Jian Yin. Coupling retrieval and meta-learning for context-dependent semantic parsing. ArXiv, abs/1906.07108,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[8]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Code- searchnet challenge: Evaluating the state of semantic code search.arXiv preprint arXiv:1909.09436,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[9]

Pre-trained contextual embedding of source code

Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. Pre-trained contextual embedding of source code. arXiv preprint arXiv:2001.00059,

work page arXiv 2001
[10]

Phrase-based statistical translation of programming languages

Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reﬂections on Programming & Software, pp. 173–184,

work page 2014
[11]

Scelmo: Source code embeddings from language models

10 Published as a conference paper at ICLR 2021 Rafael-Michael Karampatsis and Charles Sutton. Scelmo: Source code embeddings from language models. arXiv preprint arXiv:2004.13214,

work page arXiv 2021
[12]

Code prediction by feeding trees to transformers

Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. Code prediction by feeding trees to transformers. arXiv preprint arXiv:2003.13848,

work page arXiv 2003
[13]

Cross-lingual Language Model Pretraining

Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[14]

Code completion with neural attention and pointer networks

Jian Li, Yue Wang, Michael R Lyu, and Irwin King. Code completion with neural attention and pointer networks. arXiv preprint arXiv:1711.09573,

work page arXiv
[15]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[16]

Graph-based statistical language model for code

Anh Tuan Nguyen and Tien N Nguyen. Graph-based statistical language model for code. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 1, pp. 858–868. IEEE,

work page 2015
[17]

Lexical statistical machine translation for language migration

Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. Lexical statistical machine translation for language migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 651–654,

work page 2013
[18]

Divide-and-conquer approach for multi-phase statistical migration for source code (t)

Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. Divide-and-conquer approach for multi-phase statistical migration for source code (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 585–596. IEEE,

work page 2015
[19]

Deep contextualized word representations

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Abstract Syntax Networks for Code Generation and Semantic Parsing

Maxim Rabinovich, Mitchell Stern, and Dan Klein. Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[22]

Towards a big data curated benchmark of inter-project code clones

Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Mohammad Mamun Mia. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution, pp. 476–480. IEEE,

work page 2014
[23]

Intellicode compose: Code generation using transformer

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. Intellicode compose: Code generation using transformer. arXiv preprint arXiv:2005.08025,

work page arXiv 2005
[24]

Detecting code clones with graph neural networkand ﬂow-augmented abstract syntax tree

11 Published as a conference paper at ICLR 2021 Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. Detecting code clones with graph neural networkand ﬂow-augmented abstract syntax tree. arXiv preprint arXiv:2002.08653,

work page arXiv 2021
[25]

Deep learning code fragments for code clone detection

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. Deep learning code fragments for code clone detection. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98. IEEE,

work page 2016
[26]

Xlnet: Generalized autoregressive pretraining for language understanding

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237,

work page arXiv 1906
[27]

A Syntactic Neural Model for General-Purpose Code Generation

URL https://arxiv.org/abs/1704.01696. Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794. IEEE,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[28]

The dataset is the CodeSearchNet dataset 3 (Husain et al., 2019), which includes 2.3M functions with document pairs for six programming languages

to pretrain our model. The dataset is the CodeSearchNet dataset 3 (Husain et al., 2019), which includes 2.3M functions with document pairs for six programming languages. We train the model on two DGX-2 machines, each having 16 NVIDIA Tesla V100 with 32GB memory. We set the max length of sequences and nodes as 512 and 128, respectively. We use the Adam opt...

work page 2019
[29]

http://

and follow Husain et al. (2019) to take the ﬁrst paragraph of the documentation as the query for the corresponding function. However, we observe that some queries contain content unrelated to the code, such as a link “http://...” that refers to external resources. Therefore, we ﬁlter following examples to improve the quality of the dataset. (1) Examples w...

work page 2019
[30]

(4) Examples whose query is empty or not written in English

(3) Examples whose query contains special tokens such as “http://”. (4) Examples whose query is empty or not written in English. 3https://github.com/github/CodeSearchNet 12 Published as a conference paper at ICLR 2021 Different from the setting of Husain et al. (2019), the answer of each query is retrieved from the whole development and testing code corpu...

work page 2021
[31]

We also report the results using the same setting of Husain et al

We use the Adam optimizer to update model parameters and perform early stopping on the development set. We also report the results using the same setting of Husain et al. (2019) in Table

work page 2019
[32]

The results show that GraphCodeBERT also achieves the state-of-the-art performance

In this setting, models are required to retrieve an answer for a query from 1000 candidates. The results show that GraphCodeBERT also achieves the state-of-the-art performance. model Ruby Javascript Go Python Java Php Overall NBow 0.429 0.461 0.641 0.581 0.514 0.484 0.518 CNN 0.245 0.352 0.627 0.571 0.527 0.529 0.475 BiRNN 0.084 0.153 0.452 0.321 0.287 0....

work page 2019
[33]

Therefore, two codes are semantically similar since they output similar results when given the same input

In this example, two Java source codes both download content from a given URL and convert the type of the content into string type. Therefore, two codes are semantically similar since they output similar results when given the same input. As we can see, our model gives a high score for this case and the pair is classiﬁed as true clone pair. 13 Published a...

work page 2021
[34]

boolean” to “bool

In this example, the model successfully translates a piece of Java code into its C# version. The differences include the type name (from “boolean” to “bool”) and the usage of getting a string value of a bool variable (from “String.valueOf(b)” to “b.ToString()”). Figure 7: A case of GraphCodeBERT output for the code translation task. 4http://lucene.apache....

work page 2021
[35]

http://kmttg.googlecode.com/svn/trunk/version

The ﬁrst source code is to return the HTML content from a given URL, while the second source code is to return the last line from a ﬁxed URL “http://kmttg.googlecode.com/svn/trunk/version”. Their semantics are not similar due to their different outputs. Data ﬂow could help GraphCodeBERT better understand that the return value “pageHTML” in ﬁrst source cod...

work page 2021