arxiv: 1909.09436 · v3 · submitted 2019-09-20 · 💻 cs.LG · cs.IR· cs.SE· stat.ML

Recognition: 2 theorem links

· Lean Theorem

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Marc Brockschmidt, Miltiadis Allamanis, Tiferet Gazit

Pith reviewed 2026-05-12 16:00 UTC · model grok-4.3

classification 💻 cs.LG cs.IRcs.SEstat.ML

keywords semantic code searchcode retrievalnatural language queriescode corpusbenchmarkprogramming languagesinformation retrieval

0 comments

The pith

Releasing the CodeSearchNet Corpus of 6 million functions and a challenge with 99 annotated queries enables evaluation of semantic code search across six languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new resource for semantic code search by collecting millions of functions from open-source projects in Go, Java, JavaScript, PHP, Python, and Ruby. It pairs many of those functions with automatically generated natural-language text scraped from documentation and adds a set of expert-labeled queries to create a concrete benchmark. This setup lets researchers train models that retrieve code snippets matching vague natural-language descriptions and compare results on the same test cases. Simple baselines are included to show initial performance and to lower the barrier for new participants. The authors intend the release to support ongoing competitions and future expansion to additional queries and languages.

Core claim

By releasing the CodeSearchNet Corpus containing approximately 6 million functions across six programming languages together with automatically generated natural-language descriptions for two million of them, and by creating the CodeSearchNet Challenge of 99 natural-language queries annotated with roughly four thousand expert relevance judgments, the work supplies a standardized corpus and evaluation set for semantic code search.

What carries the argument

The CodeSearchNet Corpus of functions paired with scraped documentation text, plus the expert-annotated query set used to score retrieval relevance.

Load-bearing premise

Mechanically scraped and preprocessed function documentation yields sufficiently accurate and representative natural-language queries and the expert annotations are consistent and unbiased measures of relevance.

What would settle it

An independent check finding low agreement among experts on the same query-code pairs or showing that the scraped documentation text rarely matches how developers actually phrase searches would undermine the benchmark.

read the original abstract

Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task. We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending CodeSearchNet Challenge to more queries and programming languages in the future.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a data release paper that gives the field a bigger multi-language code corpus and challenge set, but the lack of validation numbers on the auto-generated queries and expert labels is a real gap.

read the letter

The main thing here is the release of the CodeSearchNet Corpus with roughly 6 million functions across six languages and a challenge built around 99 natural language queries plus about 4k expert relevance annotations. They also scraped documentation to create query-like text for 2 million functions and ran a few simple baselines. That combination of scale and multi-language coverage is new enough to be useful for training and evaluating semantic code search models, and the paper lays out the collection process clearly enough that others can build on it or extend it to more languages. The baselines give a starting point without overclaiming. The soft spot is exactly what the stress-test note flags: no reported metrics on inter-annotator agreement, query diversity, or how well the mechanically scraped docs match real user queries. Without those checks the 99-query test set could be noisier than it looks, which weakens how much progress the leaderboard can actually track. This is for researchers working on code retrieval, IR for software engineering, or neural code models who need a larger shared testbed. A reader can get immediate value by downloading the corpus and trying their own models against the expert labels. It deserves a serious referee because the data contribution is concrete and the field benefits from public benchmarks, even if the current version needs more evidence on label quality before it becomes the standard reference.

Referee Report

2 major / 2 minor

Summary. The paper claims to release the CodeSearchNet Corpus containing about 6 million functions from six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby), with automatically generated query-like natural language descriptions for 2 million functions obtained by mechanically scraping and preprocessing associated documentation. It also introduces the CodeSearchNet Challenge consisting of 99 natural language queries with about 4k expert relevance annotations of likely results from the corpus, describes the methodology for corpus construction and labeling, and provides a number of simple baseline solutions for semantic code search, with the aim of enabling evaluation and tracking of progress via a future competition and leaderboard.

Significance. If the generated queries and expert annotations prove reliable and representative, the release of this large-scale multi-language corpus and annotated challenge would be a significant contribution to semantic code search research. It provides a standardized benchmark at a scale (6M functions, 99 queries, 4k annotations) that could facilitate model development and comparison in bridging natural language and code, similar to other information retrieval benchmarks, and the inclusion of baselines supports immediate usability.

major comments (2)

[Methodology for corpus and query generation] The methodology section describing corpus construction and query generation states that the 2 million query-like descriptions are obtained by mechanically scraping and preprocessing function documentation, but provides no quantitative validation (e.g., accuracy against human queries, diversity metrics, or fidelity checks) to support that these yield sufficiently accurate and representative natural language queries for the challenge.
[Challenge construction and expert annotation process] The section on the CodeSearchNet Challenge and expert labels describes the 99 queries and ~4k annotations but reports no inter-annotator agreement statistics, annotation consistency measures, or bias validation, which is load-bearing for the central claim that the challenge enables reliable evaluation of semantic code search progress.

minor comments (2)

[Abstract and conclusion] The abstract and conclusion mention plans to host a competition and leaderboard but do not specify the platform, timeline, or evaluation protocol details.
[Baseline solutions] The baselines are introduced as 'simple' solutions; adding implementation details or pseudocode would improve reproducibility without altering the core contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review. We appreciate the opportunity to clarify the distinction between the corpus construction and the challenge, and to address the points on validation and annotation reliability.

read point-by-point responses

Referee: [Methodology for corpus and query generation] The methodology section describing corpus construction and query generation states that the 2 million query-like descriptions are obtained by mechanically scraping and preprocessing function documentation, but provides no quantitative validation (e.g., accuracy against human queries, diversity metrics, or fidelity checks) to support that these yield sufficiently accurate and representative natural language queries for the challenge.

Authors: We thank the referee for raising this. We wish to clarify an important distinction: the 2 million automatically generated descriptions are part of the released CodeSearchNet Corpus and serve as weak supervision for training models; they are obtained by scraping and preprocessing existing function documentation from open-source repositories. These descriptions are not the queries used in the CodeSearchNet Challenge. The challenge instead uses a separate set of 99 expert-written natural language queries, each paired with expert relevance annotations over candidate functions from the corpus. Because the corpus descriptions are mechanically derived from documentation that already exists in the source code, their representativeness is bounded by the quality of that documentation, which we describe in the methodology section. We did not include additional quantitative validation (such as diversity metrics or fidelity checks against human queries) because the primary contribution is the release of the large-scale resource itself rather than a claim that the derived descriptions perfectly match human queries. We can, however, add basic descriptive statistics on the generated descriptions (e.g., length distributions and language-specific characteristics) in a revision to improve transparency. revision: partial
Referee: [Challenge construction and expert annotation process] The section on the CodeSearchNet Challenge and expert labels describes the 99 queries and ~4k annotations but reports no inter-annotator agreement statistics, annotation consistency measures, or bias validation, which is load-bearing for the central claim that the challenge enables reliable evaluation of semantic code search progress.

Authors: We acknowledge that reporting inter-annotator agreement would strengthen the perceived reliability of the annotations. The 99 queries were authored by the paper authors and a small group of domain experts, and the approximately 4k relevance judgments were performed by the same expert annotators following a written annotation protocol that specified relevance criteria, handling of edge cases, and tie-breaking rules. Due to the expert-only nature of the task and resource limitations, we collected only a single annotation per query–function pair and therefore do not have the data required to compute standard IAA metrics such as Cohen’s kappa or Fleiss’ kappa. We attempted to mitigate bias and inconsistency through careful query curation, pilot annotation rounds, and discussion of difficult cases among annotators. We agree this is a limitation of the current release. In the revised manuscript we will expand the description of the annotation protocol, explicitly note the absence of multiple annotations, and discuss the implications for benchmark reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release paper with no derivations or self-referential reductions

full rationale

The paper's core contribution is the release of the CodeSearchNet Corpus (6M functions across 6 languages) and Challenge (99 NL queries with ~4k expert annotations), obtained by scraping documentation for 2M functions and expert labeling. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described methodology. Claims about enabling evaluation of semantic code search progress rest on the data collection process itself rather than any mathematical reduction to prior inputs. No self-citations or ansatzes are invoked as load-bearing steps. This matches the expected non-circular outcome for a pure data/benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical data-release paper with no mathematical derivations. It introduces no free parameters, axioms, or invented entities; all contributions rest on the described collection and annotation procedures.

pith-pipeline@v0.9.0 · 5545 in / 1058 out tokens · 68400 ms · 2026-05-12T16:00:39.026015+00:00 · methodology

discussion (0)

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
Test-Time Speculation
cs.CL 2026-05 unverdicted novelty 7.0

Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention
cs.LG 2026-05 unverdicted novelty 7.0

Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
cs.SE 2026-05 unverdicted novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
An Empirical Study of Proactive Coding Assistants in Real-World Software Development
cs.SE 2026-05 unverdicted novelty 7.0

Real developer IDE traces differ substantially from LLM simulations in behavior and structure; current proactive assistants are unreliable on real traces, and simulated data cannot substitute for real data in training.
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
cs.SE 2026-05 unverdicted novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models
cs.SE 2026-04 unverdicted novelty 7.0

PuzzleMark provides a robust and imperceptible watermarking method for code datasets using adaptive variable name concatenation and statistical verification, achieving perfect detection rates with minimal performance impact.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
cs.SE 2026-04 unverdicted novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
cs.SE 2026-04 unverdicted novelty 7.0

CodeMMR creates a unified embedding space for text, code, and images, outperforming baselines by 10 nDCG@10 points and boosting RAG code generation quality.
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
cs.SE 2026-04 unverdicted novelty 7.0

LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new Bi...
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
cs.SE 2020-09 conditional novelty 7.0

CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
Do not copy and paste! Rewriting strategies for code retrieval
cs.SE 2026-05 conditional novelty 6.0

Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.
A Survey of Reasoning-Intensive Retrieval: Progress and Challenges
cs.IR 2026-04 unverdicted novelty 6.0

A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.
VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection
cs.CR 2026-04 unverdicted novelty 6.0

VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.
Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 6.0

Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis
cs.SE 2026-04 unverdicted novelty 6.0

A structured JSON intermediate representation for LLM-generated static analysis queries outperforms both direct generation and agentic tool use, with gains of 15-25 percentage points on large models.
DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design
cs.CR 2026-04 unverdicted novelty 6.0

DuCodeMark watermarks code datasets using AST style transformations and repressible poisons for both source-code and decompilation tasks, verified by t-test, with high stealth and a 28.6% performance drop if removed.
On the Role of Fault Localization Context for LLM-Based Program Repair
cs.SE 2026-04 unverdicted novelty 6.0

More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
cs.SE 2024-01 accept novelty 6.0

CRUXEval benchmark shows current code models including GPT-4 achieve at most 81% on input and output prediction for short Python functions, exposing gaps not captured by HumanEval.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
cs.CL 2020-02 unverdicted novelty 6.0

CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study
cs.SE 2026-05 conditional novelty 5.0

Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
Towards General Text Embeddings with Multi-stage Contrastive Learning
cs.CL 2023-08 unverdicted novelty 5.0

GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models
cs.SE 2026-04 unverdicted novelty 4.0

CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning
cs.SE 2026-04 unverdicted novelty 4.0

LLMSniffer improves detection of LLM-generated code on GPTSniffer and Whodunit benchmarks by fine-tuning GraphCodeBERT via two-stage supervised contrastive learning plus preprocessing and MLP classification.
FLeX: Fourier-based Low-rank EXpansion for multilingual transfer
cs.LG 2026-04 unverdicted novelty 4.0

LoRA fine-tuning of Code Llama with Fourier regularization raises Java pass@1 from 34.2% to 42.1% while using a small high-quality dataset.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
cs.SE 2025-04

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 28 Pith papers · 1 internal anchor

[1]

Miltiadis Allamanis. 2018. The Adverse Effects of Code Duplication in Machine Learning Models of Code. arXiv preprint arXiv:1812.06469 (2018)

work page arXiv 2018
[2]

Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 81

work page 2018
[3]

Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of the International Conference on Machine Learning (ICML)

work page 2016
[4]

Uri Alon, Omer Levy, and Eran Yahav. 2018. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400 (2018)

work page Pith review arXiv 2018
[5]

Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of Python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275 (2017)

work page Pith review arXiv 2017
[6]

Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra

work page
[7]

arXiv preprint arXiv:1905.03813 (2019)

When Deep Learning Met Code Search. arXiv preprint arXiv:1905.03813 (2019)

work page arXiv 1905
[8]

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Ben- gio. 2014. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Syntax, Semantics and Structure in Statistical Translation (2014)

work page 2014
[9]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv CodeSearchNet Encoder CodeSearchNet Challenge– NDCG Within CodeSearchNet Challenge– NDCG All Text Code Go Java JS PHP Python Ruby Avg Go Java JS PHP Python Ruby Avg ElasticSearch 0.307 0.257 0.318 0...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2018. Structured Neural Summarization. arXiv preprint arXiv:1811.01824 (2018)

work page arXiv 2018
[11]

Philip Gage. 1994. A new algorithm for data compression. The C Users Journal 12, 2 (1994), 23–38

work page 1994
[12]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) . IEEE, 933–944

work page 2018
[13]

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API Learning. In Proceedings of the International Symposium on Foundations of Software Engineering (FSE)

work page 2016
[14]

Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. 2018. A retrieve-and-edit framework for predicting structured outputs. In Advances in Neural Information Processing Systems . 10073–10083

work page 2018
[15]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Map- ping language to code in programmatic context. arXiv preprint arXiv:1808.09588 (2018)

work page arXiv 2018
[16]

Yoon Kim. 2014. Convolutional neural networks for sentence classification.arXiv preprint arXiv:1408.5882 (2014)

work page arXiv 2014
[17]

Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. 2018. NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. In International Conference on Language Resources and Evaluation

work page 2018
[18]

Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, Andrew Senior, Fumin Wang, and Phil Blunsom. 2016. Latent Predictor Networks for Code Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)

work page 2016
[19]

Cristina V Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. DéjàVu: a map of code duplicates on GitHub. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 84

work page 2017
[20]

Manning, Prabhakar Raghavan, and Hinrich Schütze

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Intro- duction to Information Retrieval . Cambridge University Press

work page 2008
[21]

Bhaskar Mitra, Nick Craswell, et al. 2018. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval 13, 1 (2018), 1–126

work page 2018
[22]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)

work page 2016
[23]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems . 5998–6008

work page 2017
[24]

Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning. (2019)

work page 2019
[25]

Ziyu Yao, Daniel S Weld, Wei-Peng Chen, and Huan Sun. 2018. StaQC: A Sys- tematically Mined Question-Code Dataset from Stack Overflow. InProceedings of the 2018 World Wide Web Conference on World Wide Web . International World Wide Web Conferences Steering Committee, 1693–1703

work page 2018
[26]

Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General- Purpose Code Generation. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)

work page 2017