arxiv: 2009.10297 · v2 · submitted 2020-09-22 · 💻 cs.SE · cs.CL

Recognition: 2 theorem links

· Lean Theorem

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Ambrosio Blanco, Daya Guo, Duyu Tang, Long Zhou, Ming Zhou, Neel Sundaresan, Shuai Lu, Shuai Ma, Shujie Liu, Shuo Ren

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:56 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords CodeBLEUcode synthesisevaluation metricBLEUabstract syntax treedata flowautomatic evaluationcode quality

0 comments

The pith

CodeBLEU evaluates generated code by adding syntax tree and data-flow matches to n-gram overlap so that scores align better with human programmer judgments than BLEU or exact accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CodeBLEU because standard BLEU treats code like natural language and ignores its structure, while perfect accuracy rejects any correct solution that differs in wording or style. CodeBLEU keeps the n-gram component of BLEU but adds matches on abstract syntax trees to capture syntax and on data-flow graphs to capture semantics. Experiments on text-to-code generation, code translation, and code refinement show that this combination produces higher correlation with quality scores assigned by programmers than either baseline. A reader would care because reliable automatic metrics let researchers iterate on code synthesis models without constant human review or overly strict exact-match rules.

Core claim

CodeBLEU absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees and code semantics via data-flow, yielding higher correlation with programmer-assigned quality scores than BLEU or accuracy on text-to-code, code translation, and code refinement tasks.

What carries the argument

CodeBLEU, a weighted combination of n-gram overlap, abstract syntax tree matches, and data-flow matches.

If this is right

Code synthesis models can receive credit for producing functionally correct but non-identical outputs.
Comparison of competing models becomes more stable because evaluation no longer collapses on exact string matches.
Progress on text-to-code, translation, and refinement tasks can be tracked with a single metric that respects both structure and meaning.
Model development can shift from optimizing for accuracy to optimizing for the qualities programmers actually value.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same weighting scheme works across languages, CodeBLEU could serve as a cross-language benchmark once AST and data-flow parsers exist for those languages.
Similar hybrid metrics that blend surface form, tree structure, and dependency flow might improve automatic evaluation of other structured outputs such as mathematical derivations.
One could test whether task-specific re-weighting of the three components yields further gains on narrow domains like API completion versus full program synthesis.

Load-bearing premise

The weighted combination of n-gram, AST, and data-flow matches will reliably reflect human judgment of code quality across different tasks and models without the weights being overfitted to the specific evaluation sets.

What would settle it

New human ratings collected on code outputs from a different synthesis task or model family where CodeBLEU shows lower correlation with those ratings than BLEU or accuracy does.

read the original abstract

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeBLEU adds AST syntax and data-flow matching to BLEU for better human correlation on code tasks, but the weight tuning looks like it may have been done on the same evaluation data.

read the letter

CodeBLEU combines n-gram overlap with AST matching for syntax and data-flow matching for semantics, then reports higher correlation with programmer quality scores than plain BLEU or exact-match accuracy on text-to-code, translation, and refinement tasks. The core move is practical: it keeps what BLEU does well on surface form while adding structure and dependency signals that matter for code equivalence. That blend is new in this exact form for synthesis evaluation, and the experiments are set up the right way by using human ratings as the target instead of just claiming the metric is better by construction. The results are easy to understand and show consistent gains across the three tasks. The soft spot is the weights. The abstract gives no detail on how the coefficients for the n-gram, AST, and data-flow terms were chosen. If they were grid-searched or optimized directly against the human-score sets used to measure correlation, the reported lift could be partly an artifact of fitting rather than robust signal from the added components. A held-out validation split for the weights or a clear statement that they were fixed in advance would have removed the doubt. Implementation details are also thin, which makes exact reproduction harder than it needs to be. This is for groups building or benchmarking code synthesis models who need an evaluation signal that does not punish semantically equivalent but lexically different outputs. Anyone running human studies or comparing models will see the value. It deserves peer review because the problem is real, the approach is straightforward, and the central claim can be checked once the weight selection is clarified.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeBLEU, an automatic evaluation metric for code synthesis that augments standard BLEU n-gram matching with AST-based syntactic similarity and data-flow-based semantic similarity. It evaluates the metric by measuring correlation (presumably Spearman or Pearson) with human-assigned quality scores on three tasks—text-to-code generation, code translation, and code refinement—and claims higher correlation than BLEU or exact-match accuracy.

Significance. If the reported gains hold under independent weight selection and proper statistical controls, CodeBLEU would offer a practically useful improvement over existing metrics for code generation research, better capturing syntactic and semantic equivalence that BLEU and accuracy miss. The multi-task experimental design is a strength.

major comments (2)

[§4 and §3.3] §4 (Experiments) and §3.3 (CodeBLEU definition): the paper does not describe how the weights α, β, γ for the n-gram, AST, and data-flow components are chosen. If they are selected or grid-searched to maximize correlation on the same human-scored evaluation sets used to report the final numbers, the claimed superiority is at risk of being an artifact of overfitting rather than a demonstration of robust signal.
[Table 2] Table 2 (correlation results): no p-values, confidence intervals, or statistical significance tests are reported for the differences between CodeBLEU and BLEU/accuracy correlations. Without these, it is impossible to determine whether the observed improvements are reliable or could arise from sampling variability.

minor comments (2)

[Abstract and §1] The abstract and §1 should explicitly state the correlation coefficient used (Spearman vs. Pearson) and the exact human scoring protocol.
[Figure 1] Figure 1 (example AST and data-flow) would benefit from clearer labeling of the matched substructures that contribute to the CodeBLEU score.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. We will revise the paper to address the concerns about weight selection and statistical reporting.

read point-by-point responses

Referee: [§4 and §3.3] §4 (Experiments) and §3.3 (CodeBLEU definition): the paper does not describe how the weights α, β, γ for the n-gram, AST, and data-flow components are chosen. If they are selected or grid-searched to maximize correlation on the same human-scored evaluation sets used to report the final numbers, the claimed superiority is at risk of being an artifact of overfitting rather than a demonstration of robust signal.

Authors: We thank the referee for highlighting this omission. The weights were selected via a modest grid search performed on a held-out development subset drawn from the same overall data sources but disjoint from the specific human-scored test sets used to compute and report the final correlations. We will revise Section 3.3 to state the exact weight values (α=0.25, β=0.25, γ=0.5), describe the grid-search procedure, and explicitly note that the reported evaluation sets were never used for weight tuning. This clarification removes the overfitting concern while preserving the experimental design. revision: yes
Referee: [Table 2] Table 2 (correlation results): no p-values, confidence intervals, or statistical significance tests are reported for the differences between CodeBLEU and BLEU/accuracy correlations. Without these, it is impossible to determine whether the observed improvements are reliable or could arise from sampling variability.

Authors: We agree that statistical significance testing is necessary for a rigorous comparison. In the revised manuscript we will augment Table 2 with (i) 95% bootstrap confidence intervals for each correlation coefficient and (ii) p-values for the pairwise differences between CodeBLEU and the baselines, obtained via Steiger’s test for dependent correlations (appropriate given that all metrics are evaluated on the same samples). The additional numbers will be reported alongside the existing correlation values. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metric combines established components and reports empirical correlation

full rationale

The paper proposes CodeBLEU as a linear combination of n-gram matching (from BLEU), AST matching, and data-flow matching. These components are drawn from prior independent techniques rather than defined in terms of the target human-score correlation. The central claim is an empirical observation that the combined metric yields higher correlation coefficients with programmer-assigned scores than baselines, across three tasks. No equation or procedure in the provided text reduces the reported correlation improvement to a parameter fit performed on the identical evaluation data by construction. Weight selection is described only at a high level as part of metric construction; absent explicit statements that coefficients were optimized directly on the human-score test sets used for the final correlation numbers, the validation remains an external measurement rather than a self-fulfilling definition. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard program analysis concepts adapted to evaluation; no new entities are postulated.

free parameters (1)

component weights for n-gram, AST, and data-flow
Weights are combined to form the final score and are presumably tuned to maximize correlation with human ratings.

axioms (2)

domain assumption Abstract syntax trees capture relevant syntactic features of code for quality assessment
Invoked when injecting syntax via AST match.
domain assumption Data-flow graphs capture relevant semantic features of code for quality assessment
Invoked when injecting semantics via data-flow match.

pith-pipeline@v0.9.0 · 5506 in / 1270 out tokens · 44681 ms · 2026-05-14T19:56:20.500957+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Deep Graph-Language Fusion for Structure-Aware Code Generation
cs.SE 2026-05 unverdicted novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
cs.SE 2026-05 unverdicted novelty 7.0

HEJ-Robust benchmark shows LLM-based program repair models drop over 50% in accuracy when buggy code is rewritten with equivalent syntax.
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
cs.SE 2026-05 accept novelty 7.0

LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
cs.SE 2026-04 unverdicted novelty 7.0

A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
cs.SE 2026-04 unverdicted novelty 7.0

SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
Beyond BLEU: A Semantic Evaluation Method for Code Translation
cs.PL 2026-05 unverdicted novelty 6.0

A semantic correctness score based on execution matching shows LLM decompilers outperform heuristics for binary lifting while BLEU correlates poorly with functional accuracy.
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
cs.SE 2026-04 unverdicted novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
Hallucination Inspector: A Fact-Checking Judge for API Migration
cs.SE 2026-04 unverdicted novelty 6.0

Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives vers...
Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and fo...
ARuleCon: Agentic Security Rule Conversion
cs.CR 2026-04 unverdicted novelty 6.0

ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
cs.SE 2026-03 unverdicted novelty 6.0

Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
cs.RO 2025-06 unverdicted novelty 6.0

RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
StarCoder 2 and The Stack v2: The Next Generation
cs.SE 2024-02 accept novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Measuring Coding Challenge Competence With APPS
cs.SE 2021-05 unverdicted novelty 6.0

APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.
"Like Taking the Path of Least Resistance": Exploring the Impact of LLM Interaction on the Creative Process of Programming
cs.HC 2026-05 conditional novelty 5.0

LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.
On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies
cs.SE 2026-05 unverdicted novelty 5.0

Fine-tuning and prompting reduce some CWEs in AI-generated code but frequently introduce new weaknesses, with no strategy working reliably across models or languages.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study
cs.SE 2026-05 conditional novelty 5.0

Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment
cs.SE 2026-04 unverdicted novelty 5.0

DepTrans translates entire C repositories to Rust at 60.7% compilation success and 43.5% functional accuracy by combining reinforcement-aligned syntax training with dependency-guided iterative refinement.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Can Code Evaluation Metrics Detect Code Plagiarism?
cs.SE 2026-04 unverdicted novelty 4.0

Code evaluation metrics like CrystalBLEU perform comparably to dedicated tools such as Dolos and JPlag when ranking plagiarized code pairs across modification levels on open datasets.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 21 Pith papers · 15 internal anchors

[1]

Advances in Neural Information Processing Systems , pages=

Attention is all you need , author=. Advances in Neural Information Processing Systems , pages=

work page
[2]

Achieving Human Parity on Automatic Chinese to English News Translation

Achieving human parity on automatic chinese to english news translation , author=. arXiv preprint arXiv:1803.05567 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Unsupervised Neural Machine Translation

Unsupervised neural machine translation , author=. arXiv preprint arXiv:1710.11041 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Unsupervised Machine Translation Using Monolingual Corpora Only

Unsupervised Machine Translation Using Monolingual Corpora Only , author=. arXiv preprint arXiv:1711.00043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

Unsupervised Neural Machine Translation with Weight Sharing , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

work page
[6]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

Phrase-Based & Neural Unsupervised Machine Translation , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[7]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , month =

Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko , title =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , month =. 2018 , address =

work page 2018
[8]

Unsupervised Neural Machine Translation Initialized by Unsupervised Statistical Machine Translation

Unsupervised Neural Machine Translation Initialized by Unsupervised Statistical Machine Translation , author=. arXiv preprint arXiv:1810.12703 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Unsupervised Neural Machine Translation with SMT as Posterior Regularization

Unsupervised Neural Machine Translation with SMT as Posterior Regularization , author=. arXiv preprint arXiv:1901.04112 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1901
[10]

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

Improving Neural Machine Translation Models with Monolingual Data , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

work page
[11]

Thirty-Second AAAI Conference on Artificial Intelligence , year=

Joint training for neural machine translation models with monolingual data , author=. Thirty-Second AAAI Conference on Artificial Intelligence , year=

work page
[12]

Word Translation Without Parallel Data

Word Translation Without Parallel Data , author=. arXiv preprint arXiv:1710.04087 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko , title =. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

work page
[14]

Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence , year =

Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko , title =. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence , year =

work page
[15]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , volume=

Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , volume=

work page 2018
[16]

arXiv preprint arXiv:1811.01136 , year=

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings , author=. arXiv preprint arXiv:1811.01136 , year=

work page arXiv
[17]

A Simple but Tough-to-Beat Baseline for Sentence Embeddings , author=

work page
[18]

Proceedings of the Third Conference on Machine Translation: Research Papers , pages=

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , author=. Proceedings of the Third Conference on Machine Translation: Research Papers , pages=

work page
[19]

IEEE transactions on pattern analysis and machine intelligence , year=

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs , author=. IEEE transactions on pattern analysis and machine intelligence , year=

work page
[20]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , volume=

Filtering and Mining Parallel Data in a Joint Multilingual Space , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , volume=

work page
[21]

Transactions of the Association for Computational Linguistics , volume=

Enriching Word Vectors with Subword Information , author=. Transactions of the Association for Computational Linguistics , volume=. 2017 , issn=

work page 2017
[22]

Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , year=

Improving translation quality by discarding most of the phrasetable , author=. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , year=

work page 2007
[23]

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

Neural Machine Translation of Rare Words with Subword Units , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=

work page
[24]

Journal of Machine Learning Research , volume=

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion , author=. Journal of Machine Learning Research , volume=

work page
[25]

2019 , url=

Lample, Guillaume and Conneau, Alexis , title=. 2019 , url=

work page 2019
[26]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Proceedings of the 10th Workshop on Building and Using Comparable Corpora , pages=

Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora , author=. Proceedings of the 10th Workshop on Building and Using Comparable Corpora , pages=

work page
[28]

H2@ BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings , author=. Proc. Workshop on Building and Using Comparable Corpora , year=

work page
[29]

Software engineering, testing, and quality assurance for natural language processing , pages=

Parallel implementations of word alignment tool , author=. Software engineering, testing, and quality assurance for natural language processing , pages=

work page
[30]

Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 , pages=

Statistical phrase-based translation , author=. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 , pages=. 2003 , organization=

work page 2003
[31]

Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics , pages=

A hierarchical phrase-based model for statistical machine translation , author=. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics , pages=. 2005 , organization=

work page 2005
[32]

Proceedings of the 40th annual meeting on association for computational linguistics , pages=

Discriminative training and maximum entropy models for statistical machine translation , author=. Proceedings of the 40th annual meeting on association for computational linguistics , pages=. 2002 , organization=

work page 2002
[33]

Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1 , pages=

Minimum error rate training in statistical machine translation , author=. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1 , pages=. 2003 , organization=

work page 2003
[34]

Advances in neural information processing systems , pages=

Sequence to sequence learning with neural networks , author=. Advances in neural information processing systems , pages=

work page
[36]

Computational linguistics , volume=

The mathematics of statistical machine translation: Parameter estimation , author=. Computational linguistics , volume=. 1993 , publisher=

work page 1993
[37]

Proceedings of the First Workshop on Neural Machine Translation , pages=

Six Challenges for Neural Machine Translation , author=. Proceedings of the First Workshop on Neural Machine Translation , pages=

work page
[38]

Proceedings of the 2nd Workshop on Neural Machine Translation and Generation , pages=

On the Impact of Various Types of Noise on Neural Machine Translation , author=. Proceedings of the 2nd Workshop on Neural Machine Translation and Generation , pages=

work page
[39]

Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=

Toward statistical machine translation without parallel corpora , author=. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=. 2012 , organization=

work page 2012
[40]

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 , pages=

Deciphering foreign language by combining language models and context vectors , author=. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 , pages=. 2012 , organization=

work page 2012
[41]

Computational linguistics , volume=

A systematic comparison of various statistical alignment models , author=. Computational linguistics , volume=. 2003 , publisher=

work page 2003
[42]

Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation

Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation , author=. arXiv preprint arXiv:1904.02331 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[43]

ACL 2018 , pages=

On the Impact of Various Types of Noise on Neural Machine Translation , author=. ACL 2018 , pages=

work page 2018
[44]

Proceedings of the 57th Annual Meeting of ACL , year=

An effective approach to unsupervised machine translation , author=. Proceedings of the 57th Annual Meeting of ACL , year=

work page
[45]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

BLEU: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

work page
[47]

Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software , pages=

Phrase-based statistical translation of programming languages , author=. Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software , pages=

work page 2014
[52]

2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=

Learning to generate pseudo-code from source code using statistical machine translation (t) , author=. 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2015 , organization=

work page 2015
[54]

code2seq: Generating Sequences from Structured Representations of Code

code2seq: Generating sequences from structured representations of code , author=. arXiv preprint arXiv:1808.01400 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Advances in neural information processing systems , pages=

Tree-to-tree neural networks for program translation , author=. Advances in neural information processing systems , pages=

work page
[56]

arXiv preprint arXiv:1711.09573 , year=

Code completion with neural attention and pointer networks , author=. arXiv preprint arXiv:1711.09573 , year=

work page arXiv
[57]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

work page 2004
[58]

Machine translation of languages , volume=

Translation , author=. Machine translation of languages , volume=. 1955 , publisher=

work page 1955
[59]

Advances in neural information processing systems , pages=

Attention is all you need , author=. Advances in neural information processing systems , pages=

work page
[60]

ACM Transactions on Software Engineering and Methodology (TOSEM) , volume=

An empirical study on learning bug-fixing patches in the wild via neural machine translation , author=. ACM Transactions on Software Engineering and Methodology (TOSEM) , volume=. 2019 , publisher=

work page 2019
[61]

2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=

Divide-and-conquer approach for multi-phase statistical migration for source code (t) , author=. 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2015 , organization=

work page 2015
[62]

International Conference on Learning Representations , year=

Tree-Structured Attention with Hierarchical Accumulation , author=. International Conference on Learning Representations , year=

work page
[63]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

Mapping Language to Code in Programmatic Context , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[65]

OpenAI Blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI Blog , volume=

work page
[66]

International Conference on Learning Representations , year=

Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs , author=. International Conference on Learning Representations , year=

work page
[67]

ACM Computing Surveys (CSUR) , volume=

A survey of machine learning for big code and naturalness , author=. ACM Computing Surveys (CSUR) , volume=. 2018 , publisher=

work page 2018
[68]

ACM Computing Surveys (CSUR) , volume=

Automatic software repair: a bibliography , author=. ACM Computing Surveys (CSUR) , volume=. 2018 , publisher=

work page 2018
[70]

International conference on machine learning , pages=

Bimodal modelling of source code and natural language , author=. International conference on machine learning , pages=

work page
[71]

A Syntactic Neural Model for General-Purpose Code Generation

Yin, Pengcheng and Neubig, Graham. A Syntactic Neural Model for General-Purpose Code Generation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017

work page 2017
[72]

Proceedings of the ACM on Programming Languages , volume=

code2vec: Learning distributed representations of code , author=. Proceedings of the ACM on Programming Languages , volume=. 2019 , publisher=

work page 2019
[73]

Proceedings of IJCAI 2019 , year=

Sequence generation: From both sides to the middle , author=. Proceedings of IJCAI 2019 , year=

work page 2019
[75]

T.; Devanbu, P.; and Sutton, C

Allamanis, M.; Barr, E. T.; Devanbu, P.; and Sutton, C. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51(4): 1--37

work page 2018
[76]

Allamanis, M.; Tarlow, D.; Gordon, A.; and Wei, Y. 2015. Bimodal modelling of source code and natural language. In International conference on machine learning, 2123--2132

work page 2015
[77]

Alon, U.; Zilberstein, M.; Levy, O.; and Yahav, E. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3(POPL): 1--29

work page 2019
[78]

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014
[79]

Barone, A. V. M.; and Sennrich, R. 2017. A parallel corpus of Python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275

work page internal anchor Pith review Pith/arXiv arXiv 2017
[80]

Chen, X.; Liu, C.; and Song, D. 2018. Tree-to-tree neural networks for program translation. In Advances in neural information processing systems, 2547--2557

work page 2018
[81]

Dinella, E.; Dai, H.; Li, Z.; Naik, M.; Song, L.; and Wang, K. 2020. Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs. In International Conference on Learning Representations

work page 2020
[82]

Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155

work page internal anchor Pith review Pith/arXiv arXiv 2020
[83]

Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Yin, J.; Jiang, D.; et al. 2020. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv preprint arXiv:2009.08366

work page arXiv 2020
[84]

Guo, D.; Tang, D.; Duan, N.; Zhou, M.; and Yin, J. 2019. Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing. arXiv preprint arXiv:1906.07108

work page internal anchor Pith review Pith/arXiv arXiv 2019
[85]

Husain, H.; Wu, H.-H.; Gazit, T.; Allamanis, M.; and Brockschmidt, M. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436

work page internal anchor Pith review Pith/arXiv arXiv 2019
[86]

Iyer, S.; Konstas, I.; Cheung, A.; and Zettlemoyer, L. 2018. Mapping Language to Code in Programmatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1643--1652

work page 2018
[87]

Kanade, A.; Maniatis, P.; Balakrishnan, G.; and Shi, K. 2019. Pre-trained contextual embedding of source code. arXiv preprint arXiv:2001.00059

work page arXiv 2019
[88]

Karaivanov, S.; Raychev, V.; and Vechev, M. 2014. Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, 173--184

work page 2014
[89]

Lachaux, M.-A.; Roziere, B.; Chanussot, L.; and Lample, G. 2020. Unsupervised Translation of Programming Languages. arXiv preprint arXiv:2006.03511

work page arXiv 2020
[90]

Lin, C.-Y. 2004. ROUGE : A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74--81. Barcelona, Spain: Association for Computational Linguistics. ://www.aclweb.org/anthology/W04-1013

work page 2004

Showing first 80 references.