Recognition: 2 theorem links
· Lean TheoremCodeBLEU: a Method for Automatic Evaluation of Code Synthesis
Pith reviewed 2026-05-14 19:56 UTC · model grok-4.3
The pith
CodeBLEU evaluates generated code by adding syntax tree and data-flow matches to n-gram overlap so that scores align better with human programmer judgments than BLEU or exact accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CodeBLEU absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees and code semantics via data-flow, yielding higher correlation with programmer-assigned quality scores than BLEU or accuracy on text-to-code, code translation, and code refinement tasks.
What carries the argument
CodeBLEU, a weighted combination of n-gram overlap, abstract syntax tree matches, and data-flow matches.
If this is right
- Code synthesis models can receive credit for producing functionally correct but non-identical outputs.
- Comparison of competing models becomes more stable because evaluation no longer collapses on exact string matches.
- Progress on text-to-code, translation, and refinement tasks can be tracked with a single metric that respects both structure and meaning.
- Model development can shift from optimizing for accuracy to optimizing for the qualities programmers actually value.
Where Pith is reading between the lines
- If the same weighting scheme works across languages, CodeBLEU could serve as a cross-language benchmark once AST and data-flow parsers exist for those languages.
- Similar hybrid metrics that blend surface form, tree structure, and dependency flow might improve automatic evaluation of other structured outputs such as mathematical derivations.
- One could test whether task-specific re-weighting of the three components yields further gains on narrow domains like API completion versus full program synthesis.
Load-bearing premise
The weighted combination of n-gram, AST, and data-flow matches will reliably reflect human judgment of code quality across different tasks and models without the weights being overfitted to the specific evaluation sets.
What would settle it
New human ratings collected on code outputs from a different synthesis task or model family where CodeBLEU shows lower correlation with those ratings than BLEU or accuracy does.
read the original abstract
Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CodeBLEU, an automatic evaluation metric for code synthesis that augments standard BLEU n-gram matching with AST-based syntactic similarity and data-flow-based semantic similarity. It evaluates the metric by measuring correlation (presumably Spearman or Pearson) with human-assigned quality scores on three tasks—text-to-code generation, code translation, and code refinement—and claims higher correlation than BLEU or exact-match accuracy.
Significance. If the reported gains hold under independent weight selection and proper statistical controls, CodeBLEU would offer a practically useful improvement over existing metrics for code generation research, better capturing syntactic and semantic equivalence that BLEU and accuracy miss. The multi-task experimental design is a strength.
major comments (2)
- [§4 and §3.3] §4 (Experiments) and §3.3 (CodeBLEU definition): the paper does not describe how the weights α, β, γ for the n-gram, AST, and data-flow components are chosen. If they are selected or grid-searched to maximize correlation on the same human-scored evaluation sets used to report the final numbers, the claimed superiority is at risk of being an artifact of overfitting rather than a demonstration of robust signal.
- [Table 2] Table 2 (correlation results): no p-values, confidence intervals, or statistical significance tests are reported for the differences between CodeBLEU and BLEU/accuracy correlations. Without these, it is impossible to determine whether the observed improvements are reliable or could arise from sampling variability.
minor comments (2)
- [Abstract and §1] The abstract and §1 should explicitly state the correlation coefficient used (Spearman vs. Pearson) and the exact human scoring protocol.
- [Figure 1] Figure 1 (example AST and data-flow) would benefit from clearer labeling of the matched substructures that contribute to the CodeBLEU score.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. We will revise the paper to address the concerns about weight selection and statistical reporting.
read point-by-point responses
-
Referee: [§4 and §3.3] §4 (Experiments) and §3.3 (CodeBLEU definition): the paper does not describe how the weights α, β, γ for the n-gram, AST, and data-flow components are chosen. If they are selected or grid-searched to maximize correlation on the same human-scored evaluation sets used to report the final numbers, the claimed superiority is at risk of being an artifact of overfitting rather than a demonstration of robust signal.
Authors: We thank the referee for highlighting this omission. The weights were selected via a modest grid search performed on a held-out development subset drawn from the same overall data sources but disjoint from the specific human-scored test sets used to compute and report the final correlations. We will revise Section 3.3 to state the exact weight values (α=0.25, β=0.25, γ=0.5), describe the grid-search procedure, and explicitly note that the reported evaluation sets were never used for weight tuning. This clarification removes the overfitting concern while preserving the experimental design. revision: yes
-
Referee: [Table 2] Table 2 (correlation results): no p-values, confidence intervals, or statistical significance tests are reported for the differences between CodeBLEU and BLEU/accuracy correlations. Without these, it is impossible to determine whether the observed improvements are reliable or could arise from sampling variability.
Authors: We agree that statistical significance testing is necessary for a rigorous comparison. In the revised manuscript we will augment Table 2 with (i) 95% bootstrap confidence intervals for each correlation coefficient and (ii) p-values for the pairwise differences between CodeBLEU and the baselines, obtained via Steiger’s test for dependent correlations (appropriate given that all metrics are evaluated on the same samples). The additional numbers will be reported alongside the existing correlation values. revision: yes
Circularity Check
No significant circularity; metric combines established components and reports empirical correlation
full rationale
The paper proposes CodeBLEU as a linear combination of n-gram matching (from BLEU), AST matching, and data-flow matching. These components are drawn from prior independent techniques rather than defined in terms of the target human-score correlation. The central claim is an empirical observation that the combined metric yields higher correlation coefficients with programmer-assigned scores than baselines, across three tasks. No equation or procedure in the provided text reduces the reported correlation improvement to a parameter fit performed on the identical evaluation data by construction. Weight selection is described only at a high level as part of metric construction; absent explicit statements that coefficients were optimized directly on the human-score test sets used for the final correlation numbers, the validation remains an external measurement rather than a self-fulfilling definition. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- component weights for n-gram, AST, and data-flow
axioms (2)
- domain assumption Abstract syntax trees capture relevant syntactic features of code for quality assessment
- domain assumption Data-flow graphs capture relevant semantic features of code for quality assessment
Forward citations
Cited by 22 Pith papers
-
Deep Graph-Language Fusion for Structure-Aware Code Generation
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
-
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
HEJ-Robust benchmark shows LLM-based program repair models drop over 50% in accuracy when buggy code is rewritten with equivalent syntax.
-
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
-
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
-
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
-
Beyond BLEU: A Semantic Evaluation Method for Code Translation
A semantic correctness score based on execution matching shows LLM decompilers outperform heuristics for binary lifting while BLEU correlates poorly with functional accuracy.
-
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
-
Hallucination Inspector: A Fact-Checking Judge for API Migration
Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives vers...
-
Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning
Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and fo...
-
ARuleCon: Agentic Security Rule Conversion
ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.
-
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
-
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
Measuring Coding Challenge Competence With APPS
APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.
-
"Like Taking the Path of Least Resistance": Exploring the Impact of LLM Interaction on the Creative Process of Programming
LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.
-
On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies
Fine-tuning and prompting reduce some CWEs in AI-generated code but frequently introduce new weaknesses, with no strategy working reliably across models or languages.
-
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study
Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
-
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
-
Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment
DepTrans translates entire C repositories to Rust at 60.7% compilation success and 43.5% functional accuracy by combining reinforcement-aligned syntax training with dependency-guided iterative refinement.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Can Code Evaluation Metrics Detect Code Plagiarism?
Code evaluation metrics like CrystalBLEU perform comparably to dedicated tools such as Dolos and JPlag when ranking plagiarized code pairs across modification levels on open datasets.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , pages=
Attention is all you need , author=. Advances in Neural Information Processing Systems , pages=
-
[2]
Achieving Human Parity on Automatic Chinese to English News Translation
Achieving human parity on automatic chinese to english news translation , author=. arXiv preprint arXiv:1803.05567 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Unsupervised Neural Machine Translation
Unsupervised neural machine translation , author=. arXiv preprint arXiv:1710.11041 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Unsupervised Machine Translation Using Monolingual Corpora Only
Unsupervised Machine Translation Using Monolingual Corpora Only , author=. arXiv preprint arXiv:1711.00043 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Unsupervised Neural Machine Translation with Weight Sharing , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=
-
[6]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
Phrase-Based & Neural Unsupervised Machine Translation , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2018
-
[7]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , month =
Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko , title =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , month =. 2018 , address =
work page 2018
-
[8]
Unsupervised Neural Machine Translation Initialized by Unsupervised Statistical Machine Translation
Unsupervised Neural Machine Translation Initialized by Unsupervised Statistical Machine Translation , author=. arXiv preprint arXiv:1810.12703 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Unsupervised Neural Machine Translation with SMT as Posterior Regularization
Unsupervised Neural Machine Translation with SMT as Posterior Regularization , author=. arXiv preprint arXiv:1901.04112 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[10]
Improving Neural Machine Translation Models with Monolingual Data , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=
-
[11]
Thirty-Second AAAI Conference on Artificial Intelligence , year=
Joint training for neural machine translation models with monolingual data , author=. Thirty-Second AAAI Conference on Artificial Intelligence , year=
-
[12]
Word Translation Without Parallel Data
Word Translation Without Parallel Data , author=. arXiv preprint arXiv:1710.04087 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko , title =. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[14]
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence , year =
Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko , title =. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence , year =
-
[15]
Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , volume=
work page 2018
-
[16]
arXiv preprint arXiv:1811.01136 , year=
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings , author=. arXiv preprint arXiv:1811.01136 , year=
-
[17]
A Simple but Tough-to-Beat Baseline for Sentence Embeddings , author=
-
[18]
Proceedings of the Third Conference on Machine Translation: Research Papers , pages=
Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , author=. Proceedings of the Third Conference on Machine Translation: Research Papers , pages=
-
[19]
IEEE transactions on pattern analysis and machine intelligence , year=
Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs , author=. IEEE transactions on pattern analysis and machine intelligence , year=
-
[20]
Filtering and Mining Parallel Data in a Joint Multilingual Space , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , volume=
-
[21]
Transactions of the Association for Computational Linguistics , volume=
Enriching Word Vectors with Subword Information , author=. Transactions of the Association for Computational Linguistics , volume=. 2017 , issn=
work page 2017
-
[22]
Improving translation quality by discarding most of the phrasetable , author=. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , year=
work page 2007
-
[23]
Neural Machine Translation of Rare Words with Subword Units , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=
-
[24]
Journal of Machine Learning Research , volume=
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion , author=. Journal of Machine Learning Research , volume=
- [25]
-
[26]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Proceedings of the 10th Workshop on Building and Using Comparable Corpora , pages=
Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora , author=. Proceedings of the 10th Workshop on Building and Using Comparable Corpora , pages=
-
[28]
H2@ BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings , author=. Proc. Workshop on Building and Using Comparable Corpora , year=
-
[29]
Software engineering, testing, and quality assurance for natural language processing , pages=
Parallel implementations of word alignment tool , author=. Software engineering, testing, and quality assurance for natural language processing , pages=
-
[30]
Statistical phrase-based translation , author=. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 , pages=. 2003 , organization=
work page 2003
-
[31]
Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics , pages=
A hierarchical phrase-based model for statistical machine translation , author=. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics , pages=. 2005 , organization=
work page 2005
-
[32]
Proceedings of the 40th annual meeting on association for computational linguistics , pages=
Discriminative training and maximum entropy models for statistical machine translation , author=. Proceedings of the 40th annual meeting on association for computational linguistics , pages=. 2002 , organization=
work page 2002
-
[33]
Minimum error rate training in statistical machine translation , author=. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1 , pages=. 2003 , organization=
work page 2003
-
[34]
Advances in neural information processing systems , pages=
Sequence to sequence learning with neural networks , author=. Advances in neural information processing systems , pages=
-
[36]
Computational linguistics , volume=
The mathematics of statistical machine translation: Parameter estimation , author=. Computational linguistics , volume=. 1993 , publisher=
work page 1993
-
[37]
Proceedings of the First Workshop on Neural Machine Translation , pages=
Six Challenges for Neural Machine Translation , author=. Proceedings of the First Workshop on Neural Machine Translation , pages=
-
[38]
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation , pages=
On the Impact of Various Types of Noise on Neural Machine Translation , author=. Proceedings of the 2nd Workshop on Neural Machine Translation and Generation , pages=
-
[39]
Toward statistical machine translation without parallel corpora , author=. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=. 2012 , organization=
work page 2012
-
[40]
Deciphering foreign language by combining language models and context vectors , author=. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 , pages=. 2012 , organization=
work page 2012
-
[41]
Computational linguistics , volume=
A systematic comparison of various statistical alignment models , author=. Computational linguistics , volume=. 2003 , publisher=
work page 2003
-
[42]
Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation
Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation , author=. arXiv preprint arXiv:1904.02331 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[43]
On the Impact of Various Types of Noise on Neural Machine Translation , author=. ACL 2018 , pages=
work page 2018
-
[44]
Proceedings of the 57th Annual Meeting of ACL , year=
An effective approach to unsupervised machine translation , author=. Proceedings of the 57th Annual Meeting of ACL , year=
-
[45]
Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
BLEU: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
-
[47]
Phrase-based statistical translation of programming languages , author=. Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software , pages=
work page 2014
-
[52]
2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=
Learning to generate pseudo-code from source code using statistical machine translation (t) , author=. 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2015 , organization=
work page 2015
-
[54]
code2seq: Generating Sequences from Structured Representations of Code
code2seq: Generating sequences from structured representations of code , author=. arXiv preprint arXiv:1808.01400 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Advances in neural information processing systems , pages=
Tree-to-tree neural networks for program translation , author=. Advances in neural information processing systems , pages=
-
[56]
arXiv preprint arXiv:1711.09573 , year=
Code completion with neural attention and pointer networks , author=. arXiv preprint arXiv:1711.09573 , year=
-
[57]
ROUGE : A Package for Automatic Evaluation of Summaries
Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004
work page 2004
-
[58]
Machine translation of languages , volume=
Translation , author=. Machine translation of languages , volume=. 1955 , publisher=
work page 1955
-
[59]
Advances in neural information processing systems , pages=
Attention is all you need , author=. Advances in neural information processing systems , pages=
-
[60]
ACM Transactions on Software Engineering and Methodology (TOSEM) , volume=
An empirical study on learning bug-fixing patches in the wild via neural machine translation , author=. ACM Transactions on Software Engineering and Methodology (TOSEM) , volume=. 2019 , publisher=
work page 2019
-
[61]
2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=
Divide-and-conquer approach for multi-phase statistical migration for source code (t) , author=. 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2015 , organization=
work page 2015
-
[62]
International Conference on Learning Representations , year=
Tree-Structured Attention with Hierarchical Accumulation , author=. International Conference on Learning Representations , year=
-
[63]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
Mapping Language to Code in Programmatic Context , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2018
-
[65]
Language models are unsupervised multitask learners , author=. OpenAI Blog , volume=
-
[66]
International Conference on Learning Representations , year=
Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs , author=. International Conference on Learning Representations , year=
-
[67]
ACM Computing Surveys (CSUR) , volume=
A survey of machine learning for big code and naturalness , author=. ACM Computing Surveys (CSUR) , volume=. 2018 , publisher=
work page 2018
-
[68]
ACM Computing Surveys (CSUR) , volume=
Automatic software repair: a bibliography , author=. ACM Computing Surveys (CSUR) , volume=. 2018 , publisher=
work page 2018
-
[70]
International conference on machine learning , pages=
Bimodal modelling of source code and natural language , author=. International conference on machine learning , pages=
-
[71]
A Syntactic Neural Model for General-Purpose Code Generation
Yin, Pengcheng and Neubig, Graham. A Syntactic Neural Model for General-Purpose Code Generation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017
work page 2017
-
[72]
Proceedings of the ACM on Programming Languages , volume=
code2vec: Learning distributed representations of code , author=. Proceedings of the ACM on Programming Languages , volume=. 2019 , publisher=
work page 2019
-
[73]
Proceedings of IJCAI 2019 , year=
Sequence generation: From both sides to the middle , author=. Proceedings of IJCAI 2019 , year=
work page 2019
-
[75]
T.; Devanbu, P.; and Sutton, C
Allamanis, M.; Barr, E. T.; Devanbu, P.; and Sutton, C. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51(4): 1--37
work page 2018
-
[76]
Allamanis, M.; Tarlow, D.; Gordon, A.; and Wei, Y. 2015. Bimodal modelling of source code and natural language. In International conference on machine learning, 2123--2132
work page 2015
-
[77]
Alon, U.; Zilberstein, M.; Levy, O.; and Yahav, E. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3(POPL): 1--29
work page 2019
-
[78]
Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[79]
Barone, A. V. M.; and Sennrich, R. 2017. A parallel corpus of Python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[80]
Chen, X.; Liu, C.; and Song, D. 2018. Tree-to-tree neural networks for program translation. In Advances in neural information processing systems, 2547--2557
work page 2018
-
[81]
Dinella, E.; Dai, H.; Li, Z.; Naik, M.; Song, L.; and Wang, K. 2020. Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs. In International Conference on Learning Representations
work page 2020
-
[82]
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [83]
-
[84]
Guo, D.; Tang, D.; Duan, N.; Zhou, M.; and Yin, J. 2019. Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing. arXiv preprint arXiv:1906.07108
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[85]
Husain, H.; Wu, H.-H.; Gazit, T.; Allamanis, M.; and Brockschmidt, M. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[86]
Iyer, S.; Konstas, I.; Cheung, A.; and Zettlemoyer, L. 2018. Mapping Language to Code in Programmatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1643--1652
work page 2018
- [87]
-
[88]
Karaivanov, S.; Raychev, V.; and Vechev, M. 2014. Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, 173--184
work page 2014
- [89]
-
[90]
Lin, C.-Y. 2004. ROUGE : A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74--81. Barcelona, Spain: Association for Computational Linguistics. ://www.aclweb.org/anthology/W04-1013
work page 2004
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.