Measuring Coding Challenge Competence With APPS
Pith reviewed 2026-05-11 17:05 UTC · model grok-4.3
The pith
The APPS benchmark shows machine learning models are beginning to learn coding by passing roughly 20 percent of test cases on introductory problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
APPS contains 10,000 problems ranging from simple one-line solutions to substantial algorithmic challenges. By evaluating generated code on test cases, the benchmark finds that recent models pass approximately 20% of the test cases on introductory problems. The prevalence of syntax errors decreases exponentially with model improvements after fine-tuning on GitHub and the training set.
What carries the argument
The APPS benchmark, which evaluates code generation models by executing their Python outputs against hidden test cases that check natural language problem specifications.
Load-bearing premise
Success on the provided test cases for each problem means the generated code satisfies the original natural language specification.
What would settle it
A model that passes all test cases on a problem yet produces code that fails to match the natural language intent on some untested input, or sustained inability of models to exceed low single-digit percentages even after larger-scale training.
read the original abstract
While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for tracking advancements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the APPS benchmark consisting of 10,000 natural language programming problems paired with test cases to evaluate code generation from arbitrary specifications. Models are fine-tuned on GitHub and APPS training data; the authors report an exponential reduction in syntax errors with improving models and that GPT-Neo passes approximately 20% of test cases on introductory problems, concluding that machine learning models are beginning to learn how to code.
Significance. The creation of a large-scale benchmark with problems spanning simple one-line solutions to substantial algorithmic challenges is a valuable contribution for tracking progress in code generation. The empirical observation of exponential syntax-error reduction provides a concrete, falsifiable trend. If the test-case protocol is shown to be robust, the 20% pass-rate baseline on introductory problems offers a useful reference point for future work in automatic code synthesis.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation section: The headline claim that models are 'now beginning to learn how to code' rests on GPT-Neo passing ~20% of test cases for introductory problems. The manuscript supplies no information on data splits, test-case coverage of the natural-language specifications, or statistical controls (e.g., variance across random seeds or runs), leaving the robustness of the reported figure unclear.
- [Benchmark description] Benchmark description: The central inference that passing the supplied test cases indicates the generated code satisfies the original specification is load-bearing, yet the paper reports no audit of test-suite completeness, no adversarial augmentation of test cases, and no measurement of cases where code passes the given tests but fails on plausible unseen inputs consistent with the specification.
minor comments (2)
- [Results] Tables summarizing pass rates broken down by problem difficulty (introductory/interview/competition) would improve readability and allow readers to assess trends more precisely.
- [Experiments] A brief discussion of potential data leakage between the GitHub pre-training corpus and the APPS test set would strengthen the experimental protocol.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and for recognizing the value of the APPS benchmark. We address each major comment point by point below, indicating where revisions will be made to improve clarity and acknowledge limitations.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: The headline claim that models are 'now beginning to learn how to code' rests on GPT-Neo passing ~20% of test cases for introductory problems. The manuscript supplies no information on data splits, test-case coverage of the natural-language specifications, or statistical controls (e.g., variance across random seeds or runs), leaving the robustness of the reported figure unclear.
Authors: We agree that additional details would improve the robustness of the reported results. The data splits are described in Section 3, which specifies the 5,000/5,000 train/test division of the APPS problems. Test cases originate from the source competitive programming platforms and are intended to cover the natural language specifications, though we will add an explicit statement to this effect. For statistical controls, our primary results are from single runs; we will include a brief analysis of variance across random seeds in the revised Evaluation section. We will also update the abstract to include a short qualifier referencing these details. These changes will be incorporated in the next version. revision: yes
-
Referee: [Benchmark description] Benchmark description: The central inference that passing the supplied test cases indicates the generated code satisfies the original specification is load-bearing, yet the paper reports no audit of test-suite completeness, no adversarial augmentation of test cases, and no measurement of cases where code passes the given tests but fails on plausible unseen inputs consistent with the specification.
Authors: We acknowledge that test-case evaluation has inherent limitations and does not constitute a formal proof of correctness for all inputs. The paper does not include an audit of test-suite completeness or adversarial augmentation, as the focus is on establishing the benchmark and initial baselines rather than exhaustive verification. We will add a dedicated paragraph in the Discussion section noting this limitation, clarifying that passing the provided tests is the standard proxy used in code generation research (analogous to human assessment), and suggesting adversarial testing as an avenue for future work. No new empirical measurements will be performed for this revision. revision: partial
Circularity Check
No circularity in derivation chain
full rationale
The paper introduces the APPS benchmark with 10,000 natural-language problems and associated test cases defined independently of any model. It then evaluates models by generating Python code from the problem statements and measuring pass rates on the fixed test suites, reporting empirical results such as GPT-Neo passing approximately 20% of test cases on introductory problems. This is a direct measurement against external test cases rather than any derivation, fitted parameter, or self-referential equation; the claim that models are beginning to learn to code is an interpretation of these observed pass rates. No load-bearing self-citations, ansatzes, or uniqueness theorems are invoked, and the evaluation protocol does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Test cases supplied with each problem are sufficient to determine whether generated code satisfies the natural language specification.
Lean theorems connected to this paper
-
Foundation.LawOfExistencedefect_zero_iff_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code... we evaluate models by checking their generated code on test cases
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 59 Pith papers
-
AgentBench: Evaluating LLMs as Agents
AgentBench is a new multi-environment benchmark showing commercial LLMs outperform open-source models up to 70B parameters in agent tasks mainly due to better long-term reasoning and instruction following.
-
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% exce...
-
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
WebGameBench is a benchmark that evaluates coding agents by having them generate browser-native games from specifications, then running those games in a real browser to assign EXCELLENT, USABLE, or UNUSABLE labels, wi...
-
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.
-
Text-to-CAD Evaluation with CADTests
Introduces CADTestBench as a test-based benchmark for Text-to-CAD and shows that using CADTests to guide generation produces simple baselines outperforming prior methods.
-
GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection
GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.
-
ProgramBench: Can Language Models Rebuild Programs From Scratch?
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
-
ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation
ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.
-
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
-
PlayCoder: Making LLM-Generated GUI Code Playable
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
The Precise Debugging Benchmark reveals that frontier LLMs achieve over 76% unit-test pass rates but below 45% edit precision when debugging, often regenerating rather than making minimal fixes.
-
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
Frontier LLMs pass unit tests over 76% of the time on debugging tasks but achieve edit precision below 45%, indicating regeneration rather than precise debugging.
-
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
-
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reas...
-
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software
LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
-
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
-
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
CodeFlowBench is a new benchmark with 5000+ problems and GitHub-sourced repos that evaluates LLMs on multi-turn code reuse using dependency-tree structural metrics, revealing performance drops as complexity rises.
-
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
OpenClassGen supplies 324,843 real-world Python classes with self-contained skeletons and static metrics to support LLM class generation research and evaluation.
-
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
-
Holistic Evaluation of Language Models
HELM establishes a multi-metric evaluation covering 30 language models on 42 scenarios (16 core) to raise average scenario coverage from 17.9% to 96% under uniform conditions while releasing all prompts, completions, ...
-
Design and Report Benchmarks for Knowledge Work
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
-
Uncertainty Quantification for LLM-based Code Generation
RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
-
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
-
Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation
Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.
-
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation
VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.
-
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.
-
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
-
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
-
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo g...
-
Generalization in LLM Problem Solving: The Case of the Shortest Path
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
-
Babbling Suppression: Making LLMs Greener One Token at a Time
Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.
-
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.
-
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
-
A Study of LLMs' Preferences for Libraries and Programming Languages
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
-
Better & Faster Large Language Models via Multi-token Prediction
Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
Program Synthesis with Large Language Models
Large language models synthesize Python code from descriptions with log-linear scaling in performance, reaching 59.6% on MBPP via few-shot prompting and 83.8% on MathQA-Python after fine-tuning, while human feedback h...
-
Can LLMs Produce Better Object-Oriented Designs than Human-Involved Development?
Comparative case study on a postgraduate Java assignment finds PureAI and PostAI projects simpler with lower code smell density than PreAI but show oversimplification and weaker responsibility separation.
-
Prompt Optimization for LLM Code Generation via Reinforcement Learning
A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.
-
A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback
A-ProS uses a hybrid multi-model feedback framework with stateful refinement to improve success rates on competitive programming problems, achieving over 2x gains compared to baseline agent loops.
-
Interactive Evaluation Requires a Design Science
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axi...
-
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
-
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
-
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation
REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.
-
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency ...
-
Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards
RLAAR applies competence-gated curriculum RL with mixed accuracy and abstention rewards to reduce Lost-in-Conversation degradation, raising benchmark accuracy from 62.6% to 75.1% and calibrated abstention from 33.5% to 73.4%.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation
Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.
-
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
-
FLeX: Fourier-based Low-rank EXpansion for multilingual transfer
LoRA fine-tuning of Code Llama with Fourier regularization raises Java pass@1 from 34.2% to 42.1% while using a small high-quality dataset.
-
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
Reference graph
Works this paper leans on
-
[1]
Mining source code repositories at massive scale using language modeling
Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR) , pages 207–216. IEEE,
work page 2013
-
[2]
Sygus-comp 2018: Results and analysis
Rajeev Alur, Dana Fisman, Saswat Padhi, Rishabh Singh, and Abhishek Udupa. Sygus-comp 2018: Results and analysis. SYNT,
work page 2018
-
[3]
Language Models are Few-Shot Learners
URL https://doi.org/ 10.5281/zenodo.5297715. If you use this software, please cite it using these metadata. T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, G. Krüger, T. Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jeffrey...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.5297715 2005
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010,
-
[7]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021a. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Mapping language to code in programmatic context
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October-November
work page 2018
-
[9]
Un- supervised translation of programming languages,
Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511,
-
[10]
10 W. Ling, P. Blunsom, Edward Grefenstette, K. Hermann, Tomás Kociský, Fumin Wang, and A. Senior. Latent predictor networks for code generation. ArXiv, abs/1603.06744,
-
[11]
Generative Language Modeling for Automated Theorem Proving
Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. ArXiv, abs/2009.03393,
work page internal anchor Pith review arXiv 2009
-
[12]
Veselin Raychev, Pavol Bielik, and Martin T. Vechev. Probabilistic model for code with decision trees. Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications,
work page 2016
-
[13]
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
Shuo Ren, Daya Guo, Shuai Lu, L. Zhou, Shujie Liu, Duyu Tang, M. Zhou, A. Blanco, and S. Ma. Codebleu: a method for automatic evaluation of code synthesis. ArXiv, abs/2009.10297,
work page internal anchor Pith review arXiv 2009
-
[14]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
arXiv preprint arXiv:1911.04942 , year=
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, 2019a. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue...
-
[16]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887,
-
[17]
12 Hearthstone Django NAPS APPS Programming Language Python Python UAST Python Test Cases Number of Programs 665 18,805 17,477 232,421 Lines per Program (Avg.) 7.7 1 21.7 18.0 Number of Exercises 665 18,805 2,231 10,000 Text Input Card Text Comment Pseudocode Problem Descriptions Table 4: Further comparisons of APPS with previous datasets. Top-5 Test Case...
work page 2018
-
[18]
fail to pass even a single predefined test case
main paper. We continue the comparisons below. Ling et al. (2016) introduce datasets based on Hearthstone and Magic the Gathering card games for code generation. Oda et al. (2015) provide a language-to-code dataset using simple code comments. Zavershynskyi et al. (2018) introduce the NAPS dataset for converting pseudocode to code, obtained by crowdsourcin...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.