Measuring Coding Challenge Competence With APPS

Akul Arora; Collin Burns; Dan Hendrycks; Dawn Song; Ethan Guo; Horace He; Jacob Steinhardt; Mantas Mazeika; Samir Puranik; Saurav Kadavath

arxiv: 2105.09938 · v3 · submitted 2021-05-20 · 💻 cs.SE · cs.CL· cs.LG

Measuring Coding Challenge Competence With APPS

Dan Hendrycks , Steven Basart , Saurav Kadavath , Mantas Mazeika , Akul Arora , Ethan Guo , Collin Burns , Samir Puranik

show 3 more authors

Horace He Dawn Song Jacob Steinhardt

This is my paper

Pith reviewed 2026-05-11 17:05 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG

keywords code generationbenchmarkmachine learningprogramming problemsPythontest casesnatural language specification

0 comments

The pith

The APPS benchmark shows machine learning models are beginning to learn coding by passing roughly 20 percent of test cases on introductory problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces APPS, a benchmark of 10,000 coding problems that tests whether models can translate natural language problem descriptions into correct Python code. Models are scored by how well their generated code passes hidden test cases, similar to how companies evaluate developers. The authors fine-tune large language models and observe that syntax errors become rarer as models improve. They report that models like GPT-Neo succeed on roughly 20 percent of introductory problems. This suggests that machine learning is making initial progress on the broad skill of programming.

Core claim

APPS contains 10,000 problems ranging from simple one-line solutions to substantial algorithmic challenges. By evaluating generated code on test cases, the benchmark finds that recent models pass approximately 20% of the test cases on introductory problems. The prevalence of syntax errors decreases exponentially with model improvements after fine-tuning on GitHub and the training set.

What carries the argument

The APPS benchmark, which evaluates code generation models by executing their Python outputs against hidden test cases that check natural language problem specifications.

Load-bearing premise

Success on the provided test cases for each problem means the generated code satisfies the original natural language specification.

What would settle it

A model that passes all test cases on a problem yet produces code that fails to match the natural language intent on some untested input, or sustained inability of models to exceed low single-digit percentages even after larger-scale training.

read the original abstract

While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for tracking advancements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

APPS is a useful new benchmark for natural-language code generation at scale, but the 20% pass-rate claim overreaches without stronger evidence that the test suites actually verify the specs.

read the letter

The paper introduces APPS, a collection of 10,000 problems where models must turn arbitrary natural-language descriptions into Python code that passes provided test cases. Problems range from one-liners to full algorithmic tasks. This setup is new in its combination of scale, open-ended specs, and automatic test-based grading. Earlier code-generation evaluations were narrower, often using fixed templates or small hand-checked sets. The authors also show that fine-tuning on GitHub plus their training data cuts syntax errors exponentially and that GPT-Neo clears roughly 20% of the test cases on the easiest problems. Those are concrete, reproducible numbers that give the field a shared yardstick. The benchmark and evaluation protocol are defined independently of any model, which keeps the circularity burden low. The work is honest about its scope and supplies baselines that others can build on. The soft spot is the leap from “20% of test cases pass on intro problems” to “models are now beginning to learn how to code.” The abstract gives no audit of test-suite completeness, no count of how often passing code still fails on plausible unseen inputs that obey the spec, and no check for correlation between test cases and common training patterns. If the suites are sparse or biased toward easy cases, the headline number can be reached without general competence. That concern is real and load-bearing for the interpretation, even if the raw benchmark numbers hold up. The paper is aimed at researchers who need a practical, automatically scored measure of code generation progress. Anyone tracking when these systems might become useful for real software tasks will find the dataset and protocol worth looking at. It is coherent on its own terms and deserves a serious referee who can press on the test-coverage details and ask for more ablations. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces the APPS benchmark consisting of 10,000 natural language programming problems paired with test cases to evaluate code generation from arbitrary specifications. Models are fine-tuned on GitHub and APPS training data; the authors report an exponential reduction in syntax errors with improving models and that GPT-Neo passes approximately 20% of test cases on introductory problems, concluding that machine learning models are beginning to learn how to code.

Significance. The creation of a large-scale benchmark with problems spanning simple one-line solutions to substantial algorithmic challenges is a valuable contribution for tracking progress in code generation. The empirical observation of exponential syntax-error reduction provides a concrete, falsifiable trend. If the test-case protocol is shown to be robust, the 20% pass-rate baseline on introductory problems offers a useful reference point for future work in automatic code synthesis.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: The headline claim that models are 'now beginning to learn how to code' rests on GPT-Neo passing ~20% of test cases for introductory problems. The manuscript supplies no information on data splits, test-case coverage of the natural-language specifications, or statistical controls (e.g., variance across random seeds or runs), leaving the robustness of the reported figure unclear.
[Benchmark description] Benchmark description: The central inference that passing the supplied test cases indicates the generated code satisfies the original specification is load-bearing, yet the paper reports no audit of test-suite completeness, no adversarial augmentation of test cases, and no measurement of cases where code passes the given tests but fails on plausible unseen inputs consistent with the specification.

minor comments (2)

[Results] Tables summarizing pass rates broken down by problem difficulty (introductory/interview/competition) would improve readability and allow readers to assess trends more precisely.
[Experiments] A brief discussion of potential data leakage between the GitHub pre-training corpus and the APPS test set would strengthen the experimental protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and for recognizing the value of the APPS benchmark. We address each major comment point by point below, indicating where revisions will be made to improve clarity and acknowledge limitations.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: The headline claim that models are 'now beginning to learn how to code' rests on GPT-Neo passing ~20% of test cases for introductory problems. The manuscript supplies no information on data splits, test-case coverage of the natural-language specifications, or statistical controls (e.g., variance across random seeds or runs), leaving the robustness of the reported figure unclear.

Authors: We agree that additional details would improve the robustness of the reported results. The data splits are described in Section 3, which specifies the 5,000/5,000 train/test division of the APPS problems. Test cases originate from the source competitive programming platforms and are intended to cover the natural language specifications, though we will add an explicit statement to this effect. For statistical controls, our primary results are from single runs; we will include a brief analysis of variance across random seeds in the revised Evaluation section. We will also update the abstract to include a short qualifier referencing these details. These changes will be incorporated in the next version. revision: yes
Referee: [Benchmark description] Benchmark description: The central inference that passing the supplied test cases indicates the generated code satisfies the original specification is load-bearing, yet the paper reports no audit of test-suite completeness, no adversarial augmentation of test cases, and no measurement of cases where code passes the given tests but fails on plausible unseen inputs consistent with the specification.

Authors: We acknowledge that test-case evaluation has inherent limitations and does not constitute a formal proof of correctness for all inputs. The paper does not include an audit of test-suite completeness or adversarial augmentation, as the focus is on establishing the benchmark and initial baselines rather than exhaustive verification. We will add a dedicated paragraph in the Discussion section noting this limitation, clarifying that passing the provided tests is the standard proxy used in code generation research (analogous to human assessment), and suggesting adversarial testing as an avenue for future work. No new empirical measurements will be performed for this revision. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces the APPS benchmark with 10,000 natural-language problems and associated test cases defined independently of any model. It then evaluates models by generating Python code from the problem statements and measuring pass rates on the fixed test suites, reporting empirical results such as GPT-Neo passing approximately 20% of test cases on introductory problems. This is a direct measurement against external test cases rather than any derivation, fitted parameter, or self-referential equation; the claim that models are beginning to learn to code is an interpretation of these observed pass rates. No load-bearing self-citations, ansatzes, or uniqueness theorems are invoked, and the evaluation protocol does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that test-case passing is a sufficient proxy for code correctness; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Test cases supplied with each problem are sufficient to determine whether generated code satisfies the natural language specification.
The entire evaluation pipeline depends on this premise.

pith-pipeline@v0.9.0 · 5542 in / 1173 out tokens · 54330 ms · 2026-05-11T17:05:19.686344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code... we evaluate models by checking their generated code on test cases

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 59 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentBench: Evaluating LLMs as Agents
cs.AI 2023-08 unverdicted novelty 8.0

AgentBench is a new multi-environment benchmark showing commercial LLMs outperform open-source models up to 70B parameters in agent tasks mainly due to better long-term reasoning and instruction following.
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
cs.AI 2026-05 unverdicted novelty 7.0

WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% exce...
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
cs.AI 2026-05 unverdicted novelty 7.0

WebGameBench is a benchmark that evaluates coding agents by having them generate browser-native games from specifications, then running those games in a real browser to assign EXCELLENT, USABLE, or UNUSABLE labels, wi...
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
cs.SE 2026-05 unverdicted novelty 7.0

SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.
Text-to-CAD Evaluation with CADTests
cs.CV 2026-05 unverdicted novelty 7.0

Introduces CADTestBench as a test-based benchmark for Text-to-CAD and shows that using CADTests to guide generation produces simple baselines outperforming prior methods.
GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection
cs.LG 2026-05 unverdicted novelty 7.0

GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.
ProgramBench: Can Language Models Rebuild Programs From Scratch?
cs.SE 2026-05 unverdicted novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation
cs.SE 2026-05 unverdicted novelty 7.0

ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
cs.SE 2026-04 conditional novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
PlayCoder: Making LLM-Generated GUI Code Playable
cs.SE 2026-04 conditional novelty 7.0

PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
cs.SE 2026-04 unverdicted novelty 7.0

The Precise Debugging Benchmark reveals that frontier LLMs achieve over 76% unit-test pass rates but below 45% edit precision when debugging, often regenerating rather than making minimal fixes.
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
cs.SE 2026-04 unverdicted novelty 7.0

Frontier LLMs pass unit tests over 76% of the time on debugging tasks but achieve edit precision below 45%, indicating regeneration rather than precise debugging.
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
cs.SE 2026-04 unverdicted novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging
cs.SE 2026-03 unverdicted novelty 7.0

VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
cs.SE 2025-12 unverdicted novelty 7.0

SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
cs.SE 2025-10 conditional novelty 7.0

CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reas...
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software
cs.SE 2025-10 unverdicted novelty 7.0

LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
cs.AI 2025-06 unverdicted novelty 7.0

Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
cs.SE 2025-04 unverdicted novelty 7.0

CodeFlowBench is a new benchmark with 5000+ problems and GitHub-sourced repos that evaluates LLMs on multi-turn code reuse using dependency-tree structural metrics, revealing performance drops as complexity rises.
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
cs.SE 2025-04 accept novelty 7.0

OpenClassGen supplies 324,843 real-world Python classes with self-contained skeletons and static metrics to support LLM class generation research and evaluation.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
cs.CL 2024-10 unverdicted novelty 7.0

MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
Holistic Evaluation of Language Models
cs.CL 2022-11 accept novelty 7.0

HELM establishes a multi-metric evaluation covering 30 language models on 42 scenarios (16 core) to raise average scenario coverage from 17.9% to 96% under uniform conditions while releasing all prompts, completions, ...
Design and Report Benchmarks for Knowledge Work
cs.AI 2026-05 unverdicted novelty 6.0

Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
cs.AI 2026-05 unverdicted novelty 6.0

Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
Uncertainty Quantification for LLM-based Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
cs.LG 2026-05 unverdicted novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
cs.LG 2026-05 unverdicted novelty 6.0

GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
cs.AI 2026-05 unverdicted novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 6.0

REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
cs.SE 2026-04 unverdicted novelty 6.0

RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo g...
Generalization in LLM Problem Solving: The Case of the Shortest Path
cs.AI 2026-04 unverdicted novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Babbling Suppression: Making LLMs Greener One Token at a Time
cs.SE 2026-04 unverdicted novelty 6.0

Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
cs.CL 2026-01 unverdicted novelty 6.0

GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
cs.CL 2025-06 conditional novelty 6.0

MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
A Study of LLMs' Preferences for Libraries and Programming Languages
cs.SE 2025-03 unverdicted novelty 6.0

Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
Better & Faster Large Language Models via Multi-token Prediction
cs.CL 2024-04 conditional novelty 6.0

Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
cs.CL 2024-02 conditional novelty 6.0

DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Program Synthesis with Large Language Models
cs.PL 2021-08 unverdicted novelty 6.0

Large language models synthesize Python code from descriptions with log-linear scaling in performance, reaching 59.6% on MBPP via few-shot prompting and 83.8% on MathQA-Python after fine-tuning, while human feedback h...
Can LLMs Produce Better Object-Oriented Designs than Human-Involved Development?
cs.SE 2026-05 unverdicted novelty 5.0

Comparative case study on a postgraduate Java assignment finds PureAI and PostAI projects simpler with lower code smell density than PreAI but show oversimplification and weaker responsibility separation.
Prompt Optimization for LLM Code Generation via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 5.0

A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.
A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback
cs.SE 2026-05 unverdicted novelty 5.0

A-ProS uses a hybrid multi-model feedback framework with stateful refinement to improve success rates on competitive programming problems, achieving over 2x gains compared to baseline agent loops.
Interactive Evaluation Requires a Design Science
cs.AI 2026-05 unverdicted novelty 5.0

Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axi...
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
cs.SE 2026-05 unverdicted novelty 5.0

Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
cs.AI 2026-05 unverdicted novelty 5.0

HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation
cs.SE 2026-04 unverdicted novelty 5.0

REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
cs.AI 2025-11 conditional novelty 5.0

The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency ...
Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards
cs.CL 2025-10 unverdicted novelty 5.0

RLAAR applies competence-gated curriculum RL with mixed accuracy and abstention rewards to reduce Lost-in-Conversation degradation, raising benchmark accuracy from 62.6% to 75.1% and calibrated abstention from 33.5% to 73.4%.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation
cs.LG 2026-05 unverdicted novelty 4.0

Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
cs.SE 2026-04 unverdicted novelty 4.0

Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
FLeX: Fourier-based Low-rank EXpansion for multilingual transfer
cs.LG 2026-04 unverdicted novelty 4.0

LoRA fine-tuning of Code Llama with Fourier regularization raises Java pass@1 from 34.2% to 42.1% while using a small high-quality dataset.
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
cs.SE 2026-02 unverdicted novelty 3.0

A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
cs.AI 2025-01 unverdicted novelty 3.0

The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 57 Pith papers · 7 internal anchors

[1]

Mining source code repositories at massive scale using language modeling

Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR) , pages 207–216. IEEE,

work page 2013
[2]

Sygus-comp 2018: Results and analysis

Rajeev Alur, Dana Fisman, Saswat Padhi, Rishabh Singh, and Abhishek Udupa. Sygus-comp 2018: Results and analysis. SYNT,

work page 2018
[3]

Language Models are Few-Shot Learners

URL https://doi.org/ 10.5281/zenodo.5297715. If you use this software, please cite it using these metadata. T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, G. Krüger, T. Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jeffrey...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.5297715 2005
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Datasheets for Datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010,

work page arXiv
[7]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021a. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Mapping language to code in programmatic context

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October-November

work page 2018
[9]

Un- supervised translation of programming languages,

Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511,

work page arXiv 2006
[10]

10 W. Ling, P. Blunsom, Edward Grefenstette, K. Hermann, Tomás Kociský, Fumin Wang, and A. Senior. Latent predictor networks for code generation. ArXiv, abs/1603.06744,

work page Pith review arXiv
[11]

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. ArXiv, abs/2009.03393,

work page internal anchor Pith review arXiv 2009
[12]

Veselin Raychev, Pavol Bielik, and Martin T. Vechev. Probabilistic model for code with decision trees. Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications,

work page 2016
[13]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, L. Zhou, Shujie Liu, Duyu Tang, M. Zhou, A. Blanco, and S. Ma. Codebleu: a method for automatic evaluation of code synthesis. ArXiv, abs/2009.10297,

work page internal anchor Pith review arXiv 2009
[14]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:1911.04942 , year=

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, 2019a. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue...

work page arXiv 1911
[16]

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887,

work page Pith review arXiv
[17]

the fair use of a copyrighted work, including such use by ... scholarship, or research, is not an infringement of copyright

12 Hearthstone Django NAPS APPS Programming Language Python Python UAST Python Test Cases Number of Programs 665 18,805 17,477 232,421 Lines per Program (Avg.) 7.7 1 21.7 18.0 Number of Exercises 665 18,805 2,231 10,000 Text Input Card Text Comment Pseudocode Problem Descriptions Table 4: Further comparisons of APPS with previous datasets. Top-5 Test Case...

work page 2018
[18]

fail to pass even a single predeﬁned test case

main paper. We continue the comparisons below. Ling et al. (2016) introduce datasets based on Hearthstone and Magic the Gathering card games for code generation. Oda et al. (2015) provide a language-to-code dataset using simple code comments. Zavershynskyi et al. (2018) introduce the NAPS dataset for converting pseudocode to code, obtained by crowdsourcin...

work page 2016

[1] [1]

Mining source code repositories at massive scale using language modeling

Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR) , pages 207–216. IEEE,

work page 2013

[2] [2]

Sygus-comp 2018: Results and analysis

Rajeev Alur, Dana Fisman, Saswat Padhi, Rishabh Singh, and Abhishek Udupa. Sygus-comp 2018: Results and analysis. SYNT,

work page 2018

[3] [3]

Language Models are Few-Shot Learners

URL https://doi.org/ 10.5281/zenodo.5297715. If you use this software, please cite it using these metadata. T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, G. Krüger, T. Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jeffrey...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.5297715 2005

[4] [4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Datasheets for Datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010,

work page arXiv

[7] [7]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021a. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Mapping language to code in programmatic context

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October-November

work page 2018

[9] [9]

Un- supervised translation of programming languages,

Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511,

work page arXiv 2006

[10] [10]

10 W. Ling, P. Blunsom, Edward Grefenstette, K. Hermann, Tomás Kociský, Fumin Wang, and A. Senior. Latent predictor networks for code generation. ArXiv, abs/1603.06744,

work page Pith review arXiv

[11] [11]

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. ArXiv, abs/2009.03393,

work page internal anchor Pith review arXiv 2009

[12] [12]

Veselin Raychev, Pavol Bielik, and Martin T. Vechev. Probabilistic model for code with decision trees. Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications,

work page 2016

[13] [13]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, L. Zhou, Shujie Liu, Duyu Tang, M. Zhou, A. Blanco, and S. Ma. Codebleu: a method for automatic evaluation of code synthesis. ArXiv, abs/2009.10297,

work page internal anchor Pith review arXiv 2009

[14] [14]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:1911.04942 , year=

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, 2019a. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue...

work page arXiv 1911

[16] [16]

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887,

work page Pith review arXiv

[17] [17]

the fair use of a copyrighted work, including such use by ... scholarship, or research, is not an infringement of copyright

12 Hearthstone Django NAPS APPS Programming Language Python Python UAST Python Test Cases Number of Programs 665 18,805 17,477 232,421 Lines per Program (Avg.) 7.7 1 21.7 18.0 Number of Exercises 665 18,805 2,231 10,000 Text Input Card Text Comment Pseudocode Problem Descriptions Table 4: Further comparisons of APPS with previous datasets. Top-5 Test Case...

work page 2018

[18] [18]

fail to pass even a single predeﬁned test case

main paper. We continue the comparisons below. Ling et al. (2016) introduce datasets based on Hearthstone and Magic the Gathering card games for code generation. Oda et al. (2015) provide a language-to-code dataset using simple code comments. Zavershynskyi et al. (2018) introduce the NAPS dataset for converting pseudocode to code, obtained by crowdsourcin...

work page 2016