pith. machine review for the scientific record. sign in

arxiv: 2009.10297 · v2 · submitted 2020-09-22 · 💻 cs.SE · cs.CL

Recognition: 2 theorem links

· Lean Theorem

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Ambrosio Blanco, Daya Guo, Duyu Tang, Long Zhou, Ming Zhou, Neel Sundaresan, Shuai Lu, Shuai Ma, Shujie Liu, Shuo Ren

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:56 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords CodeBLEUcode synthesisevaluation metricBLEUabstract syntax treedata flowautomatic evaluationcode quality
0
0 comments X

The pith

CodeBLEU evaluates generated code by adding syntax tree and data-flow matches to n-gram overlap so that scores align better with human programmer judgments than BLEU or exact accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CodeBLEU because standard BLEU treats code like natural language and ignores its structure, while perfect accuracy rejects any correct solution that differs in wording or style. CodeBLEU keeps the n-gram component of BLEU but adds matches on abstract syntax trees to capture syntax and on data-flow graphs to capture semantics. Experiments on text-to-code generation, code translation, and code refinement show that this combination produces higher correlation with quality scores assigned by programmers than either baseline. A reader would care because reliable automatic metrics let researchers iterate on code synthesis models without constant human review or overly strict exact-match rules.

Core claim

CodeBLEU absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees and code semantics via data-flow, yielding higher correlation with programmer-assigned quality scores than BLEU or accuracy on text-to-code, code translation, and code refinement tasks.

What carries the argument

CodeBLEU, a weighted combination of n-gram overlap, abstract syntax tree matches, and data-flow matches.

If this is right

  • Code synthesis models can receive credit for producing functionally correct but non-identical outputs.
  • Comparison of competing models becomes more stable because evaluation no longer collapses on exact string matches.
  • Progress on text-to-code, translation, and refinement tasks can be tracked with a single metric that respects both structure and meaning.
  • Model development can shift from optimizing for accuracy to optimizing for the qualities programmers actually value.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same weighting scheme works across languages, CodeBLEU could serve as a cross-language benchmark once AST and data-flow parsers exist for those languages.
  • Similar hybrid metrics that blend surface form, tree structure, and dependency flow might improve automatic evaluation of other structured outputs such as mathematical derivations.
  • One could test whether task-specific re-weighting of the three components yields further gains on narrow domains like API completion versus full program synthesis.

Load-bearing premise

The weighted combination of n-gram, AST, and data-flow matches will reliably reflect human judgment of code quality across different tasks and models without the weights being overfitted to the specific evaluation sets.

What would settle it

New human ratings collected on code outputs from a different synthesis task or model family where CodeBLEU shows lower correlation with those ratings than BLEU or accuracy does.

read the original abstract

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeBLEU, an automatic evaluation metric for code synthesis that augments standard BLEU n-gram matching with AST-based syntactic similarity and data-flow-based semantic similarity. It evaluates the metric by measuring correlation (presumably Spearman or Pearson) with human-assigned quality scores on three tasks—text-to-code generation, code translation, and code refinement—and claims higher correlation than BLEU or exact-match accuracy.

Significance. If the reported gains hold under independent weight selection and proper statistical controls, CodeBLEU would offer a practically useful improvement over existing metrics for code generation research, better capturing syntactic and semantic equivalence that BLEU and accuracy miss. The multi-task experimental design is a strength.

major comments (2)
  1. [§4 and §3.3] §4 (Experiments) and §3.3 (CodeBLEU definition): the paper does not describe how the weights α, β, γ for the n-gram, AST, and data-flow components are chosen. If they are selected or grid-searched to maximize correlation on the same human-scored evaluation sets used to report the final numbers, the claimed superiority is at risk of being an artifact of overfitting rather than a demonstration of robust signal.
  2. [Table 2] Table 2 (correlation results): no p-values, confidence intervals, or statistical significance tests are reported for the differences between CodeBLEU and BLEU/accuracy correlations. Without these, it is impossible to determine whether the observed improvements are reliable or could arise from sampling variability.
minor comments (2)
  1. [Abstract and §1] The abstract and §1 should explicitly state the correlation coefficient used (Spearman vs. Pearson) and the exact human scoring protocol.
  2. [Figure 1] Figure 1 (example AST and data-flow) would benefit from clearer labeling of the matched substructures that contribute to the CodeBLEU score.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. We will revise the paper to address the concerns about weight selection and statistical reporting.

read point-by-point responses
  1. Referee: [§4 and §3.3] §4 (Experiments) and §3.3 (CodeBLEU definition): the paper does not describe how the weights α, β, γ for the n-gram, AST, and data-flow components are chosen. If they are selected or grid-searched to maximize correlation on the same human-scored evaluation sets used to report the final numbers, the claimed superiority is at risk of being an artifact of overfitting rather than a demonstration of robust signal.

    Authors: We thank the referee for highlighting this omission. The weights were selected via a modest grid search performed on a held-out development subset drawn from the same overall data sources but disjoint from the specific human-scored test sets used to compute and report the final correlations. We will revise Section 3.3 to state the exact weight values (α=0.25, β=0.25, γ=0.5), describe the grid-search procedure, and explicitly note that the reported evaluation sets were never used for weight tuning. This clarification removes the overfitting concern while preserving the experimental design. revision: yes

  2. Referee: [Table 2] Table 2 (correlation results): no p-values, confidence intervals, or statistical significance tests are reported for the differences between CodeBLEU and BLEU/accuracy correlations. Without these, it is impossible to determine whether the observed improvements are reliable or could arise from sampling variability.

    Authors: We agree that statistical significance testing is necessary for a rigorous comparison. In the revised manuscript we will augment Table 2 with (i) 95% bootstrap confidence intervals for each correlation coefficient and (ii) p-values for the pairwise differences between CodeBLEU and the baselines, obtained via Steiger’s test for dependent correlations (appropriate given that all metrics are evaluated on the same samples). The additional numbers will be reported alongside the existing correlation values. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metric combines established components and reports empirical correlation

full rationale

The paper proposes CodeBLEU as a linear combination of n-gram matching (from BLEU), AST matching, and data-flow matching. These components are drawn from prior independent techniques rather than defined in terms of the target human-score correlation. The central claim is an empirical observation that the combined metric yields higher correlation coefficients with programmer-assigned scores than baselines, across three tasks. No equation or procedure in the provided text reduces the reported correlation improvement to a parameter fit performed on the identical evaluation data by construction. Weight selection is described only at a high level as part of metric construction; absent explicit statements that coefficients were optimized directly on the human-score test sets used for the final correlation numbers, the validation remains an external measurement rather than a self-fulfilling definition. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard program analysis concepts adapted to evaluation; no new entities are postulated.

free parameters (1)
  • component weights for n-gram, AST, and data-flow
    Weights are combined to form the final score and are presumably tuned to maximize correlation with human ratings.
axioms (2)
  • domain assumption Abstract syntax trees capture relevant syntactic features of code for quality assessment
    Invoked when injecting syntax via AST match.
  • domain assumption Data-flow graphs capture relevant semantic features of code for quality assessment
    Invoked when injecting semantics via data-flow match.

pith-pipeline@v0.9.0 · 5506 in / 1270 out tokens · 44681 ms · 2026-05-14T19:56:20.500957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Deep Graph-Language Fusion for Structure-Aware Code Generation

    cs.SE 2026-05 unverdicted novelty 7.0

    CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

  2. HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

    cs.SE 2026-05 unverdicted novelty 7.0

    HEJ-Robust benchmark shows LLM-based program repair models drop over 50% in accuracy when buggy code is rewritten with equivalent syntax.

  3. HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

    cs.SE 2026-05 accept novelty 7.0

    LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.

  4. Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing

    cs.SE 2026-04 unverdicted novelty 7.0

    A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.

  5. SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair

    cs.SE 2026-04 unverdicted novelty 7.0

    SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.

  6. Beyond BLEU: A Semantic Evaluation Method for Code Translation

    cs.PL 2026-05 unverdicted novelty 6.0

    A semantic correctness score based on execution matching shows LLM decompilers outperform heuristics for binary lifting while BLEU correlates poorly with functional accuracy.

  7. Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

    cs.SE 2026-04 unverdicted novelty 6.0

    Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

  8. Hallucination Inspector: A Fact-Checking Judge for API Migration

    cs.SE 2026-04 unverdicted novelty 6.0

    Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives vers...

  9. Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and fo...

  10. ARuleCon: Agentic Security Rule Conversion

    cs.CR 2026-04 unverdicted novelty 6.0

    ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.

  11. Ensemble-Based Uncertainty Estimation for Code Correctness Estimation

    cs.SE 2026-03 unverdicted novelty 6.0

    Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.

  12. RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    cs.RO 2025-06 unverdicted novelty 6.0

    RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.

  13. StarCoder 2 and The Stack v2: The Next Generation

    cs.SE 2024-02 accept novelty 6.0

    StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

  14. Measuring Coding Challenge Competence With APPS

    cs.SE 2021-05 unverdicted novelty 6.0

    APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.

  15. "Like Taking the Path of Least Resistance": Exploring the Impact of LLM Interaction on the Creative Process of Programming

    cs.HC 2026-05 conditional novelty 5.0

    LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.

  16. On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies

    cs.SE 2026-05 unverdicted novelty 5.0

    Fine-tuning and prompting reduce some CWEs in AI-generated code but frequently introduce new weaknesses, with no strategy working reliably across models or languages.

  17. How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study

    cs.SE 2026-05 conditional novelty 5.0

    Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.

  18. LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

    cs.SE 2026-04 unverdicted novelty 5.0

    LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.

  19. Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment

    cs.SE 2026-04 unverdicted novelty 5.0

    DepTrans translates entire C repositories to Rust at 60.7% compilation success and 43.5% functional accuracy by combining reinforcement-aligned syntax training with dependency-guided iterative refinement.

  20. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  21. Can Code Evaluation Metrics Detect Code Plagiarism?

    cs.SE 2026-04 unverdicted novelty 4.0

    Code evaluation metrics like CrystalBLEU perform comparably to dedicated tools such as Dolos and JPlag when ranking plagiarized code pairs across modification levels on open datasets.

  22. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 21 Pith papers · 15 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , pages=

    Attention is all you need , author=. Advances in Neural Information Processing Systems , pages=

  2. [2]

    Achieving Human Parity on Automatic Chinese to English News Translation

    Achieving human parity on automatic chinese to english news translation , author=. arXiv preprint arXiv:1803.05567 , year=

  3. [3]

    Unsupervised Neural Machine Translation

    Unsupervised neural machine translation , author=. arXiv preprint arXiv:1710.11041 , year=

  4. [4]

    Unsupervised Machine Translation Using Monolingual Corpora Only

    Unsupervised Machine Translation Using Monolingual Corpora Only , author=. arXiv preprint arXiv:1711.00043 , year=

  5. [5]

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

    Unsupervised Neural Machine Translation with Weight Sharing , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

  6. [6]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

    Phrase-Based & Neural Unsupervised Machine Translation , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

  7. [7]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , month =

    Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko , title =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , month =. 2018 , address =

  8. [8]

    Unsupervised Neural Machine Translation Initialized by Unsupervised Statistical Machine Translation

    Unsupervised Neural Machine Translation Initialized by Unsupervised Statistical Machine Translation , author=. arXiv preprint arXiv:1810.12703 , year=

  9. [9]

    Unsupervised Neural Machine Translation with SMT as Posterior Regularization

    Unsupervised Neural Machine Translation with SMT as Posterior Regularization , author=. arXiv preprint arXiv:1901.04112 , year=

  10. [10]

    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

    Improving Neural Machine Translation Models with Monolingual Data , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

  11. [11]

    Thirty-Second AAAI Conference on Artificial Intelligence , year=

    Joint training for neural machine translation models with monolingual data , author=. Thirty-Second AAAI Conference on Artificial Intelligence , year=

  12. [12]

    Word Translation Without Parallel Data

    Word Translation Without Parallel Data , author=. arXiv preprint arXiv:1710.04087 , year=

  13. [13]

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

    Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko , title =. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

  14. [14]

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence , year =

    Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko , title =. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence , year =

  15. [15]

    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , volume=

    Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , volume=

  16. [16]

    arXiv preprint arXiv:1811.01136 , year=

    Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings , author=. arXiv preprint arXiv:1811.01136 , year=

  17. [17]

    A Simple but Tough-to-Beat Baseline for Sentence Embeddings , author=

  18. [18]

    Proceedings of the Third Conference on Machine Translation: Research Papers , pages=

    Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , author=. Proceedings of the Third Conference on Machine Translation: Research Papers , pages=

  19. [19]

    IEEE transactions on pattern analysis and machine intelligence , year=

    Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs , author=. IEEE transactions on pattern analysis and machine intelligence , year=

  20. [20]

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , volume=

    Filtering and Mining Parallel Data in a Joint Multilingual Space , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , volume=

  21. [21]

    Transactions of the Association for Computational Linguistics , volume=

    Enriching Word Vectors with Subword Information , author=. Transactions of the Association for Computational Linguistics , volume=. 2017 , issn=

  22. [22]

    Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , year=

    Improving translation quality by discarding most of the phrasetable , author=. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , year=

  23. [23]

    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

    Neural Machine Translation of Rare Words with Subword Units , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

  24. [24]

    Journal of Machine Learning Research , volume=

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion , author=. Journal of Machine Learning Research , volume=

  25. [25]

    2019 , url=

    Lample, Guillaume and Conneau, Alexis , title=. 2019 , url=

  26. [26]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

  27. [27]

    Proceedings of the 10th Workshop on Building and Using Comparable Corpora , pages=

    Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora , author=. Proceedings of the 10th Workshop on Building and Using Comparable Corpora , pages=

  28. [28]

    H2@ BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings , author=. Proc. Workshop on Building and Using Comparable Corpora , year=

  29. [29]

    Software engineering, testing, and quality assurance for natural language processing , pages=

    Parallel implementations of word alignment tool , author=. Software engineering, testing, and quality assurance for natural language processing , pages=

  30. [30]

    Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 , pages=

    Statistical phrase-based translation , author=. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 , pages=. 2003 , organization=

  31. [31]

    Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics , pages=

    A hierarchical phrase-based model for statistical machine translation , author=. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics , pages=. 2005 , organization=

  32. [32]

    Proceedings of the 40th annual meeting on association for computational linguistics , pages=

    Discriminative training and maximum entropy models for statistical machine translation , author=. Proceedings of the 40th annual meeting on association for computational linguistics , pages=. 2002 , organization=

  33. [33]

    Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1 , pages=

    Minimum error rate training in statistical machine translation , author=. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1 , pages=. 2003 , organization=

  34. [34]

    Advances in neural information processing systems , pages=

    Sequence to sequence learning with neural networks , author=. Advances in neural information processing systems , pages=

  35. [36]

    Computational linguistics , volume=

    The mathematics of statistical machine translation: Parameter estimation , author=. Computational linguistics , volume=. 1993 , publisher=

  36. [37]

    Proceedings of the First Workshop on Neural Machine Translation , pages=

    Six Challenges for Neural Machine Translation , author=. Proceedings of the First Workshop on Neural Machine Translation , pages=

  37. [38]

    Proceedings of the 2nd Workshop on Neural Machine Translation and Generation , pages=

    On the Impact of Various Types of Noise on Neural Machine Translation , author=. Proceedings of the 2nd Workshop on Neural Machine Translation and Generation , pages=

  38. [39]

    Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=

    Toward statistical machine translation without parallel corpora , author=. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=. 2012 , organization=

  39. [40]

    Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 , pages=

    Deciphering foreign language by combining language models and context vectors , author=. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 , pages=. 2012 , organization=

  40. [41]

    Computational linguistics , volume=

    A systematic comparison of various statistical alignment models , author=. Computational linguistics , volume=. 2003 , publisher=

  41. [42]

    Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation

    Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation , author=. arXiv preprint arXiv:1904.02331 , year=

  42. [43]

    ACL 2018 , pages=

    On the Impact of Various Types of Noise on Neural Machine Translation , author=. ACL 2018 , pages=

  43. [44]

    Proceedings of the 57th Annual Meeting of ACL , year=

    An effective approach to unsupervised machine translation , author=. Proceedings of the 57th Annual Meeting of ACL , year=

  44. [45]

    Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

    BLEU: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

  45. [47]

    Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software , pages=

    Phrase-based statistical translation of programming languages , author=. Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software , pages=

  46. [52]

    2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=

    Learning to generate pseudo-code from source code using statistical machine translation (t) , author=. 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2015 , organization=

  47. [54]

    code2seq: Generating Sequences from Structured Representations of Code

    code2seq: Generating sequences from structured representations of code , author=. arXiv preprint arXiv:1808.01400 , year=

  48. [55]

    Advances in neural information processing systems , pages=

    Tree-to-tree neural networks for program translation , author=. Advances in neural information processing systems , pages=

  49. [56]

    arXiv preprint arXiv:1711.09573 , year=

    Code completion with neural attention and pointer networks , author=. arXiv preprint arXiv:1711.09573 , year=

  50. [57]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  51. [58]

    Machine translation of languages , volume=

    Translation , author=. Machine translation of languages , volume=. 1955 , publisher=

  52. [59]

    Advances in neural information processing systems , pages=

    Attention is all you need , author=. Advances in neural information processing systems , pages=

  53. [60]

    ACM Transactions on Software Engineering and Methodology (TOSEM) , volume=

    An empirical study on learning bug-fixing patches in the wild via neural machine translation , author=. ACM Transactions on Software Engineering and Methodology (TOSEM) , volume=. 2019 , publisher=

  54. [61]

    2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=

    Divide-and-conquer approach for multi-phase statistical migration for source code (t) , author=. 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2015 , organization=

  55. [62]

    International Conference on Learning Representations , year=

    Tree-Structured Attention with Hierarchical Accumulation , author=. International Conference on Learning Representations , year=

  56. [63]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

    Mapping Language to Code in Programmatic Context , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

  57. [65]

    OpenAI Blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI Blog , volume=

  58. [66]

    International Conference on Learning Representations , year=

    Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs , author=. International Conference on Learning Representations , year=

  59. [67]

    ACM Computing Surveys (CSUR) , volume=

    A survey of machine learning for big code and naturalness , author=. ACM Computing Surveys (CSUR) , volume=. 2018 , publisher=

  60. [68]

    ACM Computing Surveys (CSUR) , volume=

    Automatic software repair: a bibliography , author=. ACM Computing Surveys (CSUR) , volume=. 2018 , publisher=

  61. [70]

    International conference on machine learning , pages=

    Bimodal modelling of source code and natural language , author=. International conference on machine learning , pages=

  62. [71]

    A Syntactic Neural Model for General-Purpose Code Generation

    Yin, Pengcheng and Neubig, Graham. A Syntactic Neural Model for General-Purpose Code Generation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017

  63. [72]

    Proceedings of the ACM on Programming Languages , volume=

    code2vec: Learning distributed representations of code , author=. Proceedings of the ACM on Programming Languages , volume=. 2019 , publisher=

  64. [73]

    Proceedings of IJCAI 2019 , year=

    Sequence generation: From both sides to the middle , author=. Proceedings of IJCAI 2019 , year=

  65. [75]

    T.; Devanbu, P.; and Sutton, C

    Allamanis, M.; Barr, E. T.; Devanbu, P.; and Sutton, C. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51(4): 1--37

  66. [76]

    Allamanis, M.; Tarlow, D.; Gordon, A.; and Wei, Y. 2015. Bimodal modelling of source code and natural language. In International conference on machine learning, 2123--2132

  67. [77]

    Alon, U.; Zilberstein, M.; Levy, O.; and Yahav, E. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3(POPL): 1--29

  68. [78]

    Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  69. [79]

    Barone, A. V. M.; and Sennrich, R. 2017. A parallel corpus of Python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275

  70. [80]

    Chen, X.; Liu, C.; and Song, D. 2018. Tree-to-tree neural networks for program translation. In Advances in neural information processing systems, 2547--2557

  71. [81]

    Dinella, E.; Dai, H.; Li, Z.; Naik, M.; Song, L.; and Wang, K. 2020. Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs. In International Conference on Learning Representations

  72. [82]

    Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155

  73. [83]

    Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Yin, J.; Jiang, D.; et al. 2020. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv preprint arXiv:2009.08366

  74. [84]

    Guo, D.; Tang, D.; Duan, N.; Zhou, M.; and Yin, J. 2019. Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing. arXiv preprint arXiv:1906.07108

  75. [85]

    Husain, H.; Wu, H.-H.; Gazit, T.; Allamanis, M.; and Brockschmidt, M. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436

  76. [86]

    Iyer, S.; Konstas, I.; Cheung, A.; and Zettlemoyer, L. 2018. Mapping Language to Code in Programmatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1643--1652

  77. [87]

    Kanade, A.; Maniatis, P.; Balakrishnan, G.; and Shi, K. 2019. Pre-trained contextual embedding of source code. arXiv preprint arXiv:2001.00059

  78. [88]

    Karaivanov, S.; Raychev, V.; and Vechev, M. 2014. Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, 173--184

  79. [89]

    Lachaux, M.-A.; Roziere, B.; Chanussot, L.; and Lample, G. 2020. Unsupervised Translation of Programming Languages. arXiv preprint arXiv:2006.03511

  80. [90]

    Lin, C.-Y. 2004. ROUGE : A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74--81. Barcelona, Spain: Association for Computational Linguistics. ://www.aclweb.org/anthology/W04-1013

Showing first 80 references.