pith. machine review for the scientific record. sign in

arxiv: 2009.08366 · v4 · submitted 2020-09-17 · 💻 cs.SE · cs.CL

Recognition: 1 theorem link

GraphCodeBERT: Pre-training Code Representations with Data Flow

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:41 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords pre-trained modelscode representationsdata flowprogramming languagescode searchclone detectioncode translationcode refinement
0
0 comments X

The pith

GraphCodeBERT improves code understanding by pre-training on data flow edges that track where variable values come from.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing code models treat programs as token sequences and miss the relations that define their meaning. This work replaces deep syntax trees with a flatter data-flow graph that records definition-use links between variables. Two new pre-training tasks teach the model to predict those links and to align token representations with the graph nodes. The resulting Transformer, equipped with graph-guided attention, reaches state-of-the-art accuracy on code search, clone detection, translation, and refinement.

Core claim

GraphCodeBERT augments masked language modeling with edge prediction on the data-flow graph and cross-modal alignment between source code and the graph. The data-flow graph encodes semantic relations of the form 'where-the-value-comes-from' without the deep nesting of an abstract syntax tree. These structure-aware objectives are realized through an efficient graph-guided masked attention mechanism inside a Transformer, yielding measurable gains on four downstream code tasks.

What carries the argument

The data-flow graph, which links variables by their definition-use relations, together with graph-guided masked attention that lets the Transformer attend along those edges.

If this is right

  • Code models can capture semantic relations more efficiently by using flat data-flow graphs rather than deep parse trees.
  • Adding explicit structure-prediction and alignment objectives during pre-training produces measurable gains on search, detection, and repair tasks.
  • The graph-guided attention mechanism allows a standard Transformer to incorporate code structure at modest extra cost.
  • State-of-the-art results on four distinct tasks indicate that semantic structure transfers across code understanding problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same data-flow pre-training could be applied to languages beyond those tested to test whether the semantic edges are language-agnostic.
  • Hybrid models that combine data-flow edges with selected AST subtrees might further improve performance on tasks that require deep syntactic awareness.
  • Downstream tools such as automated program repair or code summarization may benefit from the richer variable-relation representations learned here.

Load-bearing premise

Data-flow edges supply enough semantic structure to improve code understanding without needing the full syntactic hierarchy of an abstract syntax tree.

What would settle it

A version of the model trained without any data-flow edges would match or exceed the full GraphCodeBERT on the four evaluation tasks.

read the original abstract

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GraphCodeBERT, a Transformer-based pre-trained model for code that replaces syntactic AST structure with semantic data-flow edges encoding 'where-the-value-comes-from' relations between variables. It augments standard masked language modeling with two new structure-aware objectives (edge prediction and code-structure alignment) and implements them via graph-guided masked attention. The model is evaluated on code search, clone detection, code translation, and code refinement, where it reports state-of-the-art results and shows a preference for structure-level attention heads.

Significance. If the empirical gains are reproducible, the work demonstrates that a lightweight semantic graph (data flow) can deliver measurable improvements over token-only baselines and over deeper syntactic hierarchies while remaining computationally efficient. The explicit attention analysis and the two new pre-training tasks provide concrete, falsifiable evidence that structure-aware objectives transfer to downstream code tasks.

major comments (2)
  1. [§4] §4 (Experimental Setup): the paper must report the exact pre-training corpus size, vocabulary construction, and whether all baselines were re-trained on identical data; without these details the SOTA claim on the four tasks cannot be verified as arising from the proposed structure components rather than data differences.
  2. [§3.3] §3.3 (Graph-guided Masked Attention): the description of how data-flow edges are extracted from source code (e.g., via static analysis or heuristic rules) is insufficiently precise; a concrete algorithm or pseudocode is needed to ensure the structure is reproducible and not post-hoc tuned to the downstream tasks.
minor comments (2)
  1. [Abstract] Abstract: the sentence 'can improve GraphCodeBERT and achieves state-of-the-art' is grammatically awkward and should be rephrased to clarify that the added components improve upon prior models.
  2. [§3.1] Figure 1 or §3.1: the visualization of data-flow edges versus AST would benefit from an explicit side-by-side example on the same code snippet to illustrate the claimed reduction in hierarchy depth.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. The comments are constructive and focus on reproducibility, which we fully support. We address both major comments below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup): the paper must report the exact pre-training corpus size, vocabulary construction, and whether all baselines were re-trained on identical data; without these details the SOTA claim on the four tasks cannot be verified as arising from the proposed structure components rather than data differences.

    Authors: We agree that these experimental details are necessary for verifying that gains come from the proposed structure-aware components. The pre-training corpus, vocabulary construction, and baseline training protocols are identical to those in the CodeBERT paper on which GraphCodeBERT is built; we will add an explicit paragraph (or subsection) in §4 that states the exact corpus size, the BPE vocabulary construction procedure, and confirms that all reported baselines were either retrained or evaluated on the identical data splits and test sets. This revision will be made. revision: yes

  2. Referee: [§3.3] §3.3 (Graph-guided Masked Attention): the description of how data-flow edges are extracted from source code (e.g., via static analysis or heuristic rules) is insufficiently precise; a concrete algorithm or pseudocode is needed to ensure the structure is reproducible and not post-hoc tuned to the downstream tasks.

    Authors: We acknowledge that the current description of edge extraction is high-level and should be made fully reproducible. Data-flow edges are obtained via standard static reaching-definitions analysis on variable assignments and uses (not heuristics tuned to downstream tasks). In the revised manuscript we will insert a short algorithm box with pseudocode in §3.3 that outlines the steps: (1) parse the function, (2) identify variable definition and use sites, (3) compute reaching definitions, and (4) emit an edge from each definition to its reachable uses. The same deterministic procedure is applied uniformly during pre-training and downstream evaluation. This addition will be made. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are empirical

full rationale

The paper introduces GraphCodeBERT by replacing ASTs with data-flow edges and adding two pre-training objectives (edge prediction and code-structure alignment) inside a Transformer with graph-guided masked attention. All central claims are supported by downstream empirical results on code search, clone detection, translation, and refinement rather than any derivation, fitted parameter, or self-citation that reduces the result to its inputs by construction. No equations or steps equate a prediction to a fitted input; performance gains are measured against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The model inherits the standard Transformer architecture and masked-language-modeling objective; the only added assumptions are that data-flow edges are a useful semantic proxy and that the two new objectives can be optimized jointly with MLM.

axioms (1)
  • domain assumption Data-flow edges capture the essential semantic relations needed for code understanding tasks.
    Stated in the abstract as the motivation for choosing data flow over AST.

pith-pipeline@v0.9.0 · 5644 in / 1171 out tokens · 21509 ms · 2026-05-15T08:41:29.056846+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evaluating Tool Cloning in Agentic-AI Ecosystems

    cs.SE 2026-05 unverdicted novelty 7.0

    Tool cloning is pervasive in agentic AI ecosystems, with 60% of high-Jaccard and 85% of high-ssdeep similar pairs verified as true clones in a study of over 8,800 repositories.

  2. RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates

    cs.SE 2026-04 unverdicted novelty 7.0

    RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...

  3. Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code

    cs.SE 2026-04 unverdicted novelty 7.0

    CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...

  4. Static Program Slicing Using Language Models With Dataflow-Aware Pretraining and Constrained Decoding

    cs.SE 2026-04 unverdicted novelty 7.0

    Sliceformer improves static program slicing accuracy by up to 22% ExactMatch on Java/Python benchmarks through dataflow-preserving pretraining and lexical/syntactic constrained decoding in language models.

  5. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    cs.CL 2023-12 accept novelty 7.0

    A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.

  6. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

    cs.SE 2020-09 conditional novelty 7.0

    CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.

  7. NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification

    cs.SE 2026-05 unverdicted novelty 6.0

    NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.

  8. VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection

    cs.CR 2026-04 unverdicted novelty 6.0

    VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.

  9. Residual Risk Analysis in Benign Code: How Far Are We? A Multi-Model Semantic and Structural Similarity Approach

    cs.SE 2026-04 unverdicted novelty 6.0

    Patched functions often remain similar to vulnerable ones, and a new multi-model similarity scoring system identifies residual issues like null pointer dereferences in 61% of high-risk cases from the PrimeVul dataset.

  10. On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation

    cs.SE 2026-04 unverdicted novelty 6.0

    Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.

  11. Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis

    cs.SE 2026-04 unverdicted novelty 6.0

    A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vuln...

  12. DiffHLS: Differential Learning for High-Level Synthesis QoR Prediction with GNNs and LLM Code Embeddings

    cs.LG 2026-04 unverdicted novelty 6.0

    DiffHLS predicts HLS QoR via differential learning: separate GNN+LLM models for kernel baseline and design delta are composed to yield the final estimate, showing lower MAPE than GNN baselines on PolyBench.

  13. Static Program Slicing Using Language Models With Dataflow-Aware Pretraining and Constrained Decoding

    cs.SE 2026-04 unverdicted novelty 6.0

    Sliceformer reformulates static program slicing as seq2seq using CodeT5+ with dataflow-aware pretraining via DFG permutation and span corruption plus constrained decoding, yielding up to 22% ExactMatch gains on Java a...

  14. AFGNN: API Misuse Detection using Graph Neural Networks and Clustering

    cs.SE 2026-04 unverdicted novelty 6.0

    AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.

  15. On the Role of Fault Localization Context for LLM-Based Program Repair

    cs.SE 2026-04 unverdicted novelty 6.0

    More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.

  16. DCVD: Dual-Channel Cross-Modal Fusion for Joint Vulnerability Detection and Localization

    cs.CR 2026-05 unverdicted novelty 5.0

    DCVD performs joint function-level vulnerability detection and statement-level localization by extracting control-dependency and semantic features in parallel branches, fusing them with contrastive alignment and bidir...

  17. Learning Generalizable Multimodal Representations for Software Vulnerability Detection

    cs.SE 2026-04 unverdicted novelty 5.0

    MultiVul uses multimodal contrastive learning to align code and comment representations, yielding up to 27% F1 gains on vulnerability detection benchmarks over prompting and code-only baselines.

  18. PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection

    cs.SE 2026-04 unverdicted novelty 5.0

    Controlled experiments show PLM-GNN hybrids improve code tasks over GNN-only baselines, with PLM source having larger impact than GNN backbone.

  19. Prompt-Driven Code Summarization: A Systematic Literature Review

    cs.SE 2026-04 unverdicted novelty 4.0

    A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.

  20. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 19 Pith papers · 12 internal anchors

  1. [1]

    code2seq: Generating Sequences from Structured Representations of Code

    Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400,

  2. [2]

    Structural language models of code

    Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of code. arXiv, pp. arXiv–1910,

  3. [3]

    Generative Code Modeling with Graphs

    Marc Brockschmidt, Miltiadis Allamanis, Alexander L Gaunt, and Oleksandr Polozov. Generative code modeling with graphs. arXiv preprint arXiv:1805.08490,

  4. [4]

    Exploring software natural- ness throughneural language models

    Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, et al. Exploring software natural- ness throughneural language models. arXiv preprint arXiv:2006.12641,

  5. [5]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  6. [6]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155,

  7. [7]

    Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

    Daya Guo, Duyu Tang, Nan Duan, M. Zhou, and Jian Yin. Coupling retrieval and meta-learning for context-dependent semantic parsing. ArXiv, abs/1906.07108,

  8. [8]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Code- searchnet challenge: Evaluating the state of semantic code search.arXiv preprint arXiv:1909.09436,

  9. [9]

    Pre-trained contextual embedding of source code

    Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. Pre-trained contextual embedding of source code. arXiv preprint arXiv:2001.00059,

  10. [10]

    Phrase-based statistical translation of programming languages

    Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, pp. 173–184,

  11. [11]

    Scelmo: Source code embeddings from language models

    10 Published as a conference paper at ICLR 2021 Rafael-Michael Karampatsis and Charles Sutton. Scelmo: Source code embeddings from language models. arXiv preprint arXiv:2004.13214,

  12. [12]

    Code prediction by feeding trees to transformers

    Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. Code prediction by feeding trees to transformers. arXiv preprint arXiv:2003.13848,

  13. [13]

    Cross-lingual Language Model Pretraining

    Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291,

  14. [14]

    Code completion with neural attention and pointer networks

    Jian Li, Yue Wang, Michael R Lyu, and Irwin King. Code completion with neural attention and pointer networks. arXiv preprint arXiv:1711.09573,

  15. [15]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  16. [16]

    Graph-based statistical language model for code

    Anh Tuan Nguyen and Tien N Nguyen. Graph-based statistical language model for code. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 1, pp. 858–868. IEEE,

  17. [17]

    Lexical statistical machine translation for language migration

    Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. Lexical statistical machine translation for language migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 651–654,

  18. [18]

    Divide-and-conquer approach for multi-phase statistical migration for source code (t)

    Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. Divide-and-conquer approach for multi-phase statistical migration for source code (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 585–596. IEEE,

  19. [19]

    Deep contextualized word representations

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,

  20. [20]

    Abstract Syntax Networks for Code Generation and Semantic Parsing

    Maxim Rabinovich, Mitchell Stern, and Dan Klein. Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535,

  21. [21]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683,

  22. [22]

    Towards a big data curated benchmark of inter-project code clones

    Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Mohammad Mamun Mia. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution, pp. 476–480. IEEE,

  23. [23]

    Intellicode compose: Code generation using transformer

    Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. Intellicode compose: Code generation using transformer. arXiv preprint arXiv:2005.08025,

  24. [24]

    Detecting code clones with graph neural networkand flow-augmented abstract syntax tree

    11 Published as a conference paper at ICLR 2021 Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. Detecting code clones with graph neural networkand flow-augmented abstract syntax tree. arXiv preprint arXiv:2002.08653,

  25. [25]

    Deep learning code fragments for code clone detection

    Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. Deep learning code fragments for code clone detection. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98. IEEE,

  26. [26]

    Xlnet: Generalized autoregressive pretraining for language understanding

    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237,

  27. [27]

    A Syntactic Neural Model for General-Purpose Code Generation

    URL https://arxiv.org/abs/1704.01696. Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794. IEEE,

  28. [28]

    The dataset is the CodeSearchNet dataset 3 (Husain et al., 2019), which includes 2.3M functions with document pairs for six programming languages

    to pretrain our model. The dataset is the CodeSearchNet dataset 3 (Husain et al., 2019), which includes 2.3M functions with document pairs for six programming languages. We train the model on two DGX-2 machines, each having 16 NVIDIA Tesla V100 with 32GB memory. We set the max length of sequences and nodes as 512 and 128, respectively. We use the Adam opt...

  29. [29]

    http://

    and follow Husain et al. (2019) to take the first paragraph of the documentation as the query for the corresponding function. However, we observe that some queries contain content unrelated to the code, such as a link “http://...” that refers to external resources. Therefore, we filter following examples to improve the quality of the dataset. (1) Examples w...

  30. [30]

    (4) Examples whose query is empty or not written in English

    (3) Examples whose query contains special tokens such as “http://”. (4) Examples whose query is empty or not written in English. 3https://github.com/github/CodeSearchNet 12 Published as a conference paper at ICLR 2021 Different from the setting of Husain et al. (2019), the answer of each query is retrieved from the whole development and testing code corpu...

  31. [31]

    We also report the results using the same setting of Husain et al

    We use the Adam optimizer to update model parameters and perform early stopping on the development set. We also report the results using the same setting of Husain et al. (2019) in Table

  32. [32]

    The results show that GraphCodeBERT also achieves the state-of-the-art performance

    In this setting, models are required to retrieve an answer for a query from 1000 candidates. The results show that GraphCodeBERT also achieves the state-of-the-art performance. model Ruby Javascript Go Python Java Php Overall NBow 0.429 0.461 0.641 0.581 0.514 0.484 0.518 CNN 0.245 0.352 0.627 0.571 0.527 0.529 0.475 BiRNN 0.084 0.153 0.452 0.321 0.287 0....

  33. [33]

    Therefore, two codes are semantically similar since they output similar results when given the same input

    In this example, two Java source codes both download content from a given URL and convert the type of the content into string type. Therefore, two codes are semantically similar since they output similar results when given the same input. As we can see, our model gives a high score for this case and the pair is classified as true clone pair. 13 Published a...

  34. [34]

    boolean” to “bool

    In this example, the model successfully translates a piece of Java code into its C# version. The differences include the type name (from “boolean” to “bool”) and the usage of getting a string value of a bool variable (from “String.valueOf(b)” to “b.ToString()”). Figure 7: A case of GraphCodeBERT output for the code translation task. 4http://lucene.apache....

  35. [35]

    http://kmttg.googlecode.com/svn/trunk/version

    The first source code is to return the HTML content from a given URL, while the second source code is to return the last line from a fixed URL “http://kmttg.googlecode.com/svn/trunk/version”. Their semantics are not similar due to their different outputs. Data flow could help GraphCodeBERT better understand that the return value “pageHTML” in first source cod...