pith. machine review for the scientific record. sign in

arxiv: 2403.07974 · v2 · submitted 2024-03-12 · 💻 cs.SE · cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:27 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG
keywords LLM code evaluationbenchmark contaminationlive benchmarkcode generationprogramming contestsself-repairmodel overfittingholistic assessment
0
0 comments X

The pith

LiveCodeBench gathers recent contest problems to evaluate LLMs on code without training data contamination and across multiple skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for code LLMs like HumanEval risk contamination from training data, limiting how much they reveal about actual model abilities. The paper introduces LiveCodeBench, which pulls fresh problems from LeetCode, AtCoder, and CodeForces contests after May 2023 and tests not only code generation but also self-repair, execution, and test output prediction. It applies this setup to 18 base and 34 instruction-tuned models, producing comparisons that highlight performance gaps and signs of overfitting on older tests. A reader would care because accurate, up-to-date evaluations matter for knowing whether models can handle real coding work without relying on leaked examples.

Core claim

The paper establishes that LiveCodeBench, built by continuously collecting 400 high-quality problems from contests on LeetCode, AtCoder, and CodeForces between May 2023 and May 2024, supplies a contamination-free and holistic evaluation of LLMs for code that covers generation, self-repair, code execution, and test output prediction, yielding empirical results on contamination levels, model comparisons, and overfitting in static benchmarks.

What carries the argument

LiveCodeBench, a dynamic set of contest problems paired with prompts for multiple code tasks that updates over time.

If this is right

  • Static benchmarks can no longer be trusted to measure true progress once contamination is possible.
  • Models must succeed on repair, execution, and prediction tasks to demonstrate broad code competence.
  • Ongoing collection of new problems keeps the evaluation relevant as models improve.
  • Public release of prompts and completions enables targeted analysis of where models fail.
  • The provided toolkit supports adding new evaluation scenarios without rebuilding the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar live-collection methods could be applied to other domains such as mathematics problem solving to reduce leakage.
  • Model developers may begin training or fine-tuning specifically on recent contest data to improve scores.
  • Production teams could adopt this benchmark to select models less likely to rely on memorized solutions.
  • The approach implies that all future code benchmarks should incorporate temporal freshness as a core requirement.

Load-bearing premise

Problems taken from recent contests have not entered the training data of the evaluated models and adequately represent real coding demands.

What would settle it

Discovery that a substantial number of LiveCodeBench problems appear in the pretraining corpora of the tested models, or that model rankings on this benchmark match those on older contaminated ones exactly.

read the original abstract

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LiveCodeBench, a benchmark that continuously collects 400 coding problems published between May 2023 and May 2024 from LeetCode, AtCoder, and CodeForces contests. It evaluates 18 base and 34 instruction-tuned LLMs (52 total) not only on code generation but also on self-repair, code execution, and test output prediction. The work reports empirical findings on contamination, holistic performance, potential overfitting in static benchmarks such as HumanEval and MBPP, and plans to release all prompts, completions, and a toolkit for extension.

Significance. If the contamination-free claim can be substantiated, LiveCodeBench would provide a valuable, evolving resource for assessing generalization in code LLMs beyond saturated static benchmarks. The multi-capability scope and public release of data and completions are strengths that support reproducibility and community use.

major comments (2)
  1. [Abstract and Introduction] Abstract and Introduction: The central claim that LiveCodeBench is 'contamination-free' rests on the recency of the 400 problems (May 2023–May 2024) without describing any direct verification procedure, such as n-gram overlap searches against public web archives, GitHub repositories, or known training data proxies where contest problems and solutions are routinely posted shortly after release.
  2. [Evaluation and Results sections] Evaluation and Results sections: The reported empirical findings on contamination and overfitting (performance gaps versus HumanEval/MBPP) do not specify how problem quality was filtered, how test cases were validated, or whether prompt variations and temperature settings were controlled across the 52 models; without these details the attribution of differences to contamination versus difficulty or prompt sensitivity remains unclear.
minor comments (2)
  1. [Data Collection] The manuscript should clarify the exact criteria used to select and deduplicate the 400 problems across the three platforms.
  2. [Figures and Tables] Figure captions and table headers could more explicitly state the number of problems per capability (generation, repair, execution, prediction) to aid interpretation of the holistic comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing clarifications and committing to revisions that strengthen the substantiation of our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract and Introduction] Abstract and Introduction: The central claim that LiveCodeBench is 'contamination-free' rests on the recency of the 400 problems (May 2023–May 2024) without describing any direct verification procedure, such as n-gram overlap searches against public web archives, GitHub repositories, or known training data proxies where contest problems and solutions are routinely posted shortly after release.

    Authors: We appreciate the referee highlighting the need for explicit verification procedures to support the contamination-free claim. The manuscript grounds this claim in the recency of the problems (May 2023–May 2024) relative to the documented training data cutoffs of the 52 evaluated models. To address the concern directly, the revised manuscript will include a new subsection under Data Collection that reports n-gram overlap analyses (n=5–10) performed against public web archives, GitHub repositories of contest solutions, and other accessible proxies. We will present the overlap statistics, discuss any detected cases, and explain why recency combined with these checks provides a practical and defensible basis for the claim, while acknowledging the inherent limits of verifying against proprietary training corpora. revision: yes

  2. Referee: [Evaluation and Results sections] Evaluation and Results sections: The reported empirical findings on contamination and overfitting (performance gaps versus HumanEval/MBPP) do not specify how problem quality was filtered, how test cases were validated, or whether prompt variations and temperature settings were controlled across the 52 models; without these details the attribution of differences to contamination versus difficulty or prompt sensitivity remains unclear.

    Authors: We agree that greater methodological transparency is required to support the attribution of results. The original manuscript notes that problems are drawn directly from official contest platforms and are therefore high-quality, but we will expand the Evaluation and Results sections in the revision. Specifically, we will add: (1) explicit quality filtering criteria, including requirements for complete problem statements, at least three test cases per problem, and balanced coverage of difficulty levels across LeetCode, AtCoder, and Codeforces; (2) test-case validation details, describing automated execution against reference solutions plus manual spot-checks for correctness and edge-case coverage; and (3) experimental controls, confirming use of fixed prompt templates, temperature=0 for deterministic code generation, and identical decoding parameters across all models. These additions will make clear that observed gaps versus HumanEval/MBPP are not artifacts of inconsistent prompting or unvalidated tests. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark built from external contest data

full rationale

The paper constructs LiveCodeBench by collecting 400 problems published May 2023–May 2024 from independent contest platforms (LeetCode, AtCoder, CodeForces) and evaluates 52 LLMs on them, releasing prompts and completions. No derivations, equations, fitted parameters, or predictions appear; the contamination-free claim rests on problem recency and external sources rather than any self-referential definition, self-citation chain, or renaming of prior results. The central claims are therefore independent of the paper's own outputs and do not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that contest problems published after May 2023 are uncontaminated and representative; no free parameters, invented entities, or additional axioms are introduced beyond standard benchmark construction practices.

axioms (1)
  • domain assumption Problems from recent LeetCode, AtCoder, and CodeForces contests are free from contamination in current LLM training corpora.
    Stated in the abstract as the basis for the contamination-free claim.

pith-pipeline@v0.9.0 · 5546 in / 1310 out tokens · 41937 ms · 2026-05-10T17:27:24.585537+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

    cs.AI 2026-05 accept novelty 8.0

    AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

  2. FlowCompile: An Optimizing Compiler for Structured LLM Workflows

    cs.CL 2026-05 unverdicted novelty 8.0

    FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

  3. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

    cs.CL 2026-05 unverdicted novelty 8.0

    A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

  4. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  5. LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    cs.CL 2024-06 unverdicted novelty 8.0

    LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

  6. Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-train...

  7. AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

    cs.SE 2026-05 conditional novelty 7.0

    10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

  8. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  9. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

    cs.LG 2026-05 conditional novelty 7.0

    Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

  10. Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

    cs.LG 2026-05 unverdicted novelty 7.0

    DuST uses on-policy RL to train code models on ranking their own sampled solutions by sandbox execution correctness, improving judgment NDCG, pass@1, and Best-of-4 accuracy while showing that SFT on the same data does...

  11. Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

    cs.LG 2026-05 conditional novelty 7.0

    DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.

  12. Test-Time Speculation

    cs.CL 2026-05 unverdicted novelty 7.0

    Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.

  13. ProactBench: Beyond What The User Asked For

    cs.LG 2026-05 unverdicted novelty 7.0

    ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

  14. Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

  15. TeamBench: Evaluating Agent Coordination under Enforced Role Separation

    cs.AI 2026-05 unverdicted novelty 7.0

    Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.

  16. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  17. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  18. POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

    cs.SE 2026-05 unverdicted novelty 7.0

    POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.

  19. ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation

    cs.SE 2026-05 unverdicted novelty 7.0

    ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.

  20. MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

    cs.CL 2026-05 unverdicted novelty 7.0

    MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

  21. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

  22. Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

    cs.AI 2026-05 unverdicted novelty 7.0

    TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...

  23. ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.

  24. When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.

  25. Incisor: Ex Ante Cloud Instance Selection for HPC Jobs

    cs.DC 2026-04 unverdicted novelty 7.0

    Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...

  26. OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

    cs.CL 2026-04 unverdicted novelty 7.0

    OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...

  27. Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

    cs.SE 2026-04 conditional novelty 7.0

    Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.

  28. Super Apriel: One Checkpoint, Many Speeds

    cs.LG 2026-04 unverdicted novelty 7.0

    A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.

  29. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  30. Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

    cs.LG 2026-04 unverdicted novelty 7.0

    EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...

  31. CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

    cs.SE 2026-04 accept novelty 7.0

    CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.

  32. Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...

  33. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

    cs.AI 2026-04 unverdicted novelty 7.0

    HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.

  34. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  35. ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

    cs.LG 2026-04 unverdicted novelty 7.0

    ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...

  36. Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

    cs.CL 2026-04 unverdicted novelty 7.0

    LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.

  37. Think Anywhere in Code Generation

    cs.SE 2026-03 unverdicted novelty 7.0

    Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.

  38. BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations

    cs.NE 2026-03 unverdicted novelty 7.0

    BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.

  39. LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

    cs.CV 2026-03 unverdicted novelty 7.0

    KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.

  40. EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

    cs.LG 2026-03 conditional novelty 7.0

    EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

  41. Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

    cs.SE 2026-03 unverdicted novelty 7.0

    Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.

  42. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  43. Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

  44. Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

    cs.LG 2026-05 unverdicted novelty 6.0

    Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.

  45. Scalable Token-Level Hallucination Detection in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...

  46. SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

    cs.CR 2026-05 unverdicted novelty 6.0

    SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.

  47. Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

    cs.LG 2026-05 conditional novelty 6.0

    ProFIL trains an activation probe on a frozen base model to zero advantages on theatrical post-commitment rollouts in GRPO, cutting theater 11-100%, raising faithful fractions, and shortening chains 4-19% without accu...

  48. gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

    gr-qc 2026-05 unverdicted novelty 6.0

    LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.

  49. Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...

  50. Edit-Based Refinement for Parallel Masked Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.

  51. DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

  52. VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

    cs.SE 2026-05 unverdicted novelty 6.0

    VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.

  53. Coding Agents Don't Know When to Act

    cs.SE 2026-05 unverdicted novelty 6.0

    Coding agents exhibit action bias by proposing undesirable changes on already-fixed issues 35-65% of the time, and explicit reproduction instructions only partially mitigate this while creating new abstention errors.

  54. HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

    cs.LG 2026-05 unverdicted novelty 6.0

    HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...

  55. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

    cs.AI 2026-05 unverdicted novelty 6.0

    CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...

  56. Learning Agent Routing From Early Experience

    cs.CL 2026-05 unverdicted novelty 6.0

    BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.

  57. Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...

  58. Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

    cs.AI 2026-05 unverdicted novelty 6.0

    A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...

  59. Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

  60. Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 6.0

    REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.

Reference graph

Works this paper leans on

257 extracted references · 200 canonical work pages · cited by 115 Pith papers · 47 internal anchors

  1. [1]

    arXiv preprint arXiv:2303.17568 , year=

    Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x , author =. arXiv preprint arXiv:2303.17568 , year =

  2. [6]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

    CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

  3. [7]

    The Eleventh International Conference on Learning Representations , year =

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author =. The Eleventh International Conference on Learning Representations , year =

  4. [9]

    Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922, 2023

    CodeT5+: Open code large language models for code understanding and generation , author =. arXiv preprint arXiv:2305.07922 , year =

  5. [10]

    Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages =

    A systematic evaluation of large language models of code , author =. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages =

  6. [13]

    arXiv preprint arXiv:2212.10264 , year=

    ReCode: Robustness Evaluation of Code Generation Models , author =. arXiv preprint arXiv:2212.10264 , year =

  7. [16]

    Advances in Neural Information Processing Systems , volume =

    Unsupervised translation of programming languages , author =. Advances in Neural Information Processing Systems , volume =

  8. [17]

    arXiv preprint arXiv:2206.08474 , year=

    Xlcost: A benchmark dataset for cross-lingual code intelligence , author =. arXiv preprint arXiv:2206.08474 , year =

  9. [18]

    Toolcoder: Teach code generation models to use API search tools,

    ToolCoder: Teach Code Generation Models to use APIs with search tools , author =. arXiv preprint arXiv:2305.04032 , year =

  10. [19]

    arXiv:2212.09248 , year=

    Natural language to code generation in interactive data science notebooks , author =. arXiv preprint arXiv:2212.09248 , year =

  11. [20]

    International Conference on Machine Learning , pages =

    DS-1000: A natural and reliable benchmark for data science code generation , author =. International Conference on Machine Learning , pages =. 2023 , organization =

  12. [21]

    Proceedings of the 44th International Conference on Software Engineering , pages =

    Jigsaw: Large language models meet program synthesis , author =. Proceedings of the 44th International Conference on Software Engineering , pages =

  13. [22]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Codesearchnet challenge: Evaluating the state of semantic code search , author =. arXiv preprint arXiv:1909.09436 , year =

  14. [23]

    Mistral 7B

    Mistral 7B , author =. arXiv preprint arXiv:2310.06825 , year =

  15. [25]

    Textbooks Are All You Need

    Textbooks Are All You Need , author =. arXiv preprint arXiv:2306.11644 , year =

  16. [26]

    Advances in Neural Information Processing Systems , volume =

    Training language models to follow instructions with human feedback , author =. Advances in Neural Information Processing Systems , volume =

  17. [27]

    Advances in neural information processing systems , volume =

    Language models are few-shot learners , author =. Advances in neural information processing systems , volume =

  18. [29]

    2023 , url =

    Michael Royzen and Justin Wei and Russell Coleman , title =. 2023 , url =

  19. [30]

    Advances in Neural Information Processing Systems , volume =

    Chain-of-thought prompting elicits reasoning in large language models , author =. Advances in Neural Information Processing Systems , volume =

  20. [31]

    arXiv preprint arXiv:2210.00848 , year =

    I speak, you verify: Toward trustworthy neural program synthesis , author =. arXiv preprint arXiv:2210.00848 , year =

  21. [33]

    International Conference on Machine Learning , pages =

    Coder reviewer reranking for code generation , author =. International Conference on Machine Learning , pages =. 2023 , organization =

  22. [34]

    arXiv preprint arXiv:2305.05383 , year=

    Code Execution with Pre-trained Language Models , author =. arXiv preprint arXiv:2305.05383 , year =

  23. [35]

    Codescore: Evaluating code generation by learning code execution

    Codescore: Evaluating code generation by learning code execution , author =. arXiv preprint arXiv:2301.09043 , year =

  24. [36]

    step count

    Test-Case-Driven Programming Understanding in Large Language Models for Better Code Generation , author =. arXiv preprint arXiv:2309.16120 , year =

  25. [37]

    International Conference on Machine Learning , pages =

    Lever: Learning to verify language-to-code generation with execution , author =. International Conference on Machine Learning , pages =. 2023 , organization =

  26. [38]

    arXiv preprint arXiv:2204.11454 , year =

    Natural language to code translation with execution , author =. arXiv preprint arXiv:2204.11454 , year =

  27. [39]

    arXiv preprint arXiv:2202.07612 , year =

    CodeGen-Test: An Automatic Code Generation Model Integrating Program Test Information , author =. arXiv preprint arXiv:2202.07612 , year =

  28. [40]

    arXiv preprint arXiv:2207.14502 , year =

    Language models can teach themselves to program better , author =. arXiv preprint arXiv:2207.14502 , year =

  29. [44]

    arXiv preprint arXiv:2212.10481 , year=

    Execution-based evaluation for open-domain code generation , author =. arXiv preprint arXiv:2212.10481 , year =

  30. [45]

    Advances in Neural Information Processing Systems , volume =

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author =. Advances in Neural Information Processing Systems , volume =

  31. [46]

    Large language models for software engineering: Survey and open problems

    Large Language Models for Software Engineering: Survey and Open Problems , author =. arXiv preprint arXiv:2310.03533 , year =

  32. [47]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Large language models meet NL2Code: A survey , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  33. [49]

    arXiv preprint arXiv:2305.15717 , year =

    The false promise of imitating proprietary llms , author =. arXiv preprint arXiv:2305.15717 , year =

  34. [50]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

    Self-Edit: Fault-Aware Code Editor for Code Generation , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

  35. [53]

    arXiv preprint arXiv:2307.14936 , year =

    Pangu-coder2: Boosting large language models for code with ranking feedback , author =. arXiv preprint arXiv:2307.14936 , year =

  36. [54]

    arXiv preprint arXiv:2303.05510 , year=

    Planning with large language models for code generation , author =. arXiv preprint arXiv:2303.05510 , year =

  37. [55]

    2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) , pages =

    NL2Type: inferring JavaScript function types from natural language information , author =. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) , pages =. 2019 , organization =

  38. [56]

    arXiv preprint arXiv:2303.09564 , year =

    TypeT5: Seq2seq Type Inference using Static Analysis , author =. arXiv preprint arXiv:2303.09564 , year =

  39. [57]

    Proceedings of the 44th International Conference on Software Engineering , pages =

    Type4py: Practical deep similarity learning-based type inference for python , author =. Proceedings of the 44th International Conference on Software Engineering , pages =

  40. [58]

    IEEE Transactions on Software Engineering , volume =

    ATOM: Commit message generation based on abstract syntax tree and hybrid ranking , author =. IEEE Transactions on Software Engineering , volume =. 2020 , publisher =

  41. [61]

    54th Annual Meeting of the Association for Computational Linguistics 2016 , pages =

    Summarizing source code using a neural attention model , author =. 54th Annual Meeting of the Association for Computational Linguistics 2016 , pages =. 2016 , organization =

  42. [62]

    2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) , pages =

    A neural model for generating natural language summaries of program subroutines , author =. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) , pages =. 2019 , organization =

  43. [64]

    arXiv preprint arXiv:2108.11590 (2023)

    AVATAR: A Parallel Corpus for Java-Python Program Translation , author =. arXiv preprint arXiv:2108.11590 , year =

  44. [65]

    arXiv preprint arXiv:2206.13619 , year =

    DeepPERF: A Deep Learning-Based Approach For Improving Software Performance , author =. arXiv preprint arXiv:2206.13619 , year =

  45. [67]

    Proceedings of the aaai conference on artificial intelligence , volume =

    Deepfix: Fixing common c language errors by deep learning , author =. Proceedings of the aaai conference on artificial intelligence , volume =

  46. [68]

    arXiv preprint arXiv:2303.07263 , year =

    Inferfix: End-to-end program repair with llms , author =. arXiv preprint arXiv:2303.07263 , year =

  47. [69]

    Fixeval: Execution-based evaluation of program fixes for competitive programming problems , author =

  48. [70]

    ACM Transactions on Software Engineering and Methodology (TOSEM) , volume =

    An empirical study on learning bug-fixing patches in the wild via neural machine translation , author =. ACM Transactions on Software Engineering and Methodology (TOSEM) , volume =. 2019 , publisher =

  49. [71]

    arXiv preprint arXiv:2210.14179 , year =

    Practical program repair in the era of large pre-trained language models , author =. arXiv preprint arXiv:2210.14179 , year =

  50. [72]

    Jiang, K

    Impact of code language models on automated program repair , author =. arXiv preprint arXiv:2302.05020 , year =

  51. [73]

    International Conference on Machine Learning , pages =

    Tfix: Learning to fix coding errors with a text-to-text transformer , author =. International Conference on Machine Learning , pages =. 2021 , organization =

  52. [74]

    arXiv preprint arXiv:2303.09384 , year =

    LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations , author =. arXiv preprint arXiv:2303.09384 , year =

  53. [75]

    2022 IEEE Symposium on Security and Privacy (SP) , pages =

    Asleep at the keyboard? assessing the security of github copilot’s code contributions , author =. 2022 IEEE Symposium on Security and Privacy (SP) , pages =. 2022 , organization =

  54. [76]

    Automated Software Engineering , volume =

    Can we generate shellcodes via natural language? An empirical study , author =. Automated Software Engineering , volume =. 2022 , publisher =

  55. [77]

    International Conference on Machine Learning , pages =

    Repository-level prompt generation for large language models of code , author =. International Conference on Machine Learning , pages =. 2023 , organization =

  56. [78]

    Zhang, B

    Repocoder: Repository-level code completion through iterative retrieval and generation , author =. arXiv preprint arXiv:2303.12570 , year =

  57. [80]

    Proceedings of the 19th International Conference on Mining Software Repositories , pages =

    Methods2Test: A dataset of focal methods mapped to test cases , author =. Proceedings of the 19th International Conference on Mining Software Repositories , pages =

  58. [81]

    Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =

    On learning meaningful assert statements for unit test cases , author =. Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =

  59. [82]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

    The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" , author =. arXiv preprint arXiv:2309.12288 , year =

  60. [83]

    arXiv preprint arXiv:2307.02477 , year=

    Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks , author =. arXiv preprint arXiv:2307.02477 , year =

  61. [84]

    International Conference on Machine Learning , pages =

    Large language models can be easily distracted by irrelevant context , author =. International Conference on Machine Learning , pages =. 2023 , organization =

  62. [85]

    Hosseini, S

    Understanding by understanding not: Modeling negation in language models , author =. arXiv preprint arXiv:2105.03519 , year =

  63. [86]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sparks of artificial general intelligence: Early experiments with gpt-4 , author =. arXiv preprint arXiv:2303.12712 , year =

  64. [87]

    arXiv preprint arXiv:2305.15507 , year =

    The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python , author =. arXiv preprint arXiv:2305.15507 , year =

  65. [88]

    arXiv preprint arXiv:2308.03762 , year =

    GPT-4 Can't Reason , author =. arXiv preprint arXiv:2308.03762 , year =

  66. [90]

    2023 , eprint =

    A Survey on Language Models for Code , author =. 2023 , eprint =

  67. [91]

    arXiv preprint arXiv:2205.11502 , year =

    On the paradox of learning to reason from data , author =. arXiv preprint arXiv:2205.11502 , year =

  68. [92]

    arXiv preprint arXiv:2310.15164 , year =

    LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers , author =. arXiv preprint arXiv:2310.15164 , year =

  69. [93]

    Teaching arithmetic to small transformers.arXiv preprint arXiv:2307.03381, 2023

    Teaching arithmetic to small transformers , author =. arXiv preprint arXiv:2307.03381 , year =

  70. [94]

    What algorithms can transformers learn? a study in length generalization.arXiv preprint arXiv:2310.16028,

    What Algorithms can Transformers Learn? A Study in Length Generalization , author =. arXiv preprint arXiv:2310.16028 , year =

  71. [95]

    Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D

    Faith and Fate: Limits of Transformers on Compositionality , author =. arXiv preprint arXiv:2305.18654 , year =

  72. [96]

    The expressive power of transformers with chain of thought, 2024

    The Expresssive Power of Transformers with Chain of Thought , author =. arXiv preprint arXiv:2310.07923 , year =

  73. [97]

    4) Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J

    Looped transformers as programmable computers , author =. arXiv preprint arXiv:2301.13196 , year =

  74. [98]

    2023 , month = jun, journal =

    Can Transformers Learn to Solve Problems Recursively? , author =. arXiv preprint arXiv:2305.14699 , year =

  75. [99]

    Solving math word problems with process- and outcome-based feedback

    Solving math word problems with process-and outcome-based feedback , author =. arXiv preprint arXiv:2211.14275 , year =

  76. [100]

    ACM Computing Surveys , volume =

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author =. ACM Computing Surveys , volume =. 2023 , publisher =

  77. [101]

    org/abs/2311.08516

    LLMs cannot find reasoning errors, but can correct them! , author =. arXiv preprint arXiv:2311.08516 , year =

  78. [102]

    2022 , journal =

    InCoder: A Generative Model for Code Infilling and Synthesis , author =. 2022 , journal =

  79. [104]

    and Koller, Alexander , booktitle =

    Bender, Emily M. and Koller, Alexander , booktitle =. Climbing towards. 2020 , doi =

  80. [105]

    Transactions of the Association for Computational Linguistics , year =

    Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand? , author =. Transactions of the Association for Computational Linguistics , year =

Showing first 80 references.