arxiv: 2401.14196 · v2 · submitted 2024-01-25 · 💻 cs.SE · cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Dejian Yang, Fuli Luo, Guanting Chen, Kai Dong, Qihao Zhu, Wenfeng Liang, Wentao Zhang, Xiao Bi, Yingfei Xiong, Y.K. Li, Y. Wu, Zhenda Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG

keywords DeepSeek-Codercode intelligencelarge language modelsopen-source modelscode generationprogramming benchmarksfill-in-the-blank trainingproject-level code corpus

0 comments

The pith

Open-source code models trained on 2 trillion tokens surpass Codex and GPT-3.5 on programming benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the DeepSeek-Coder series of open-source models sized from 1.3B to 33B parameters. These models are trained from scratch on a high-quality project-level code corpus using a fill-in-the-blank task with a 16K context window to improve generation and infilling. Evaluations across multiple benchmarks show they set new records for open-source code models while exceeding closed-source systems such as Codex and GPT-3.5. A sympathetic reader would care because the work removes the restriction of closed-source models, making high-performance code intelligence available for research and commercial use under a permissive license.

Core claim

We present the DeepSeek-Coder series of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. Pre-trained on a high-quality project-level code corpus and employing a fill-in-the-blank task with a 16K window, the models achieve state-of-the-art performance among open-source code models and surpass existing closed-source models like Codex and GPT-3.5 across multiple benchmarks. The models are released under a permissive license that allows both research and unrestricted commercial use.

What carries the argument

The DeepSeek-Coder series, built by pre-training on a high-quality project-level code corpus and applying a 16K-context fill-in-the-blank objective that strengthens code generation and infilling.

If this is right

Researchers can freely study, modify, and build upon models that match or exceed leading closed-source code systems.
Commercial developers can integrate the models into products without licensing fees or usage restrictions.
The demonstrated gains from project-level data and long-context infilling training provide a concrete recipe others can replicate or extend.
Continued open development of these models can accelerate progress in automated programming assistance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training on complete projects instead of isolated functions may be necessary for models to manage the dependencies found in actual software systems.
The permissive license could encourage community-driven refinements that mirror the evolution of open-source software ecosystems.
Further increases in context length beyond 16K tokens could yield additional improvements when models process entire codebases.

Load-bearing premise

The chosen benchmarks and evaluation protocol give a fair and generalizable measure of code intelligence that extends beyond the specific test sets used.

What would settle it

An independent evaluation on a fresh collection of real-world coding tasks drawn from projects completed after the training data cutoff, showing DeepSeek-Coder underperforming GPT-3.5, would falsify the superiority claim.

read the original abstract

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepSeek-Coder releases a family of open code models up to 33B trained on 2T project-level tokens that claim to beat closed models, but the evaluation details are too thin to judge the claims.

read the letter

The main takeaway is that this paper presents the DeepSeek-Coder family of open-source large language models specialized for code, with sizes from 1.3 billion to 33 billion parameters. They were trained from scratch on a 2 trillion token corpus focused on high-quality project-level code, using a fill-in-the-blank pretraining task and supporting a 16K token context window. The authors report that these models achieve state-of-the-art results among open models and even surpass closed-source ones such as Codex and GPT-3.5 on various code benchmarks, all while being released under a permissive license for research and commercial use.

Referee Report

2 major / 0 minor

Summary. The paper introduces the DeepSeek-Coder series of open-source code LLMs (1.3B to 33B parameters) trained from scratch on 2 trillion tokens of high-quality project-level code data, using a fill-in-the-blank task with a 16K context window. It claims these models achieve state-of-the-art performance among open-source code models across multiple benchmarks and surpass closed-source models such as Codex and GPT-3.5, while being released under a permissive license.

Significance. If the reported performance gains hold after accounting for evaluation details and data contamination risks, this would represent a meaningful advance in open code intelligence, providing accessible high-performing models that could support broader research and commercial applications in software development.

major comments (2)

[Evaluation section] Evaluation section: The central claim that DeepSeek-Coder surpasses closed-source models like Codex and GPT-3.5 (and achieves SOTA among open models) is asserted without any reported details on the specific benchmarks (e.g., HumanEval, MBPP), data splits, pass@k computation, number of runs, error bars, or exact comparison methodology. This absence makes the headline empirical results unverifiable from the manuscript.
[Training data section] Training data section: The 2-trillion-token project-level corpus is described at a high level with no quantitative decontamination statistics, overlap analysis, or membership-inference results relative to standard code benchmarks. Without this, performance improvements cannot be confidently distinguished from potential memorization or leakage, which is load-bearing for the generalization claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. We will incorporate revisions to address the concerns about evaluation details and training data analysis.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The central claim that DeepSeek-Coder surpasses closed-source models like Codex and GPT-3.5 (and achieves SOTA among open models) is asserted without any reported details on the specific benchmarks (e.g., HumanEval, MBPP), data splits, pass@k computation, number of runs, error bars, or exact comparison methodology. This absence makes the headline empirical results unverifiable from the manuscript.

Authors: We appreciate this observation on verifiability. The Evaluation section of the manuscript reports results across benchmarks including HumanEval and MBPP using the pass@k metric and provides comparisons to Codex and GPT-3.5 based on their published numbers. However, we agree that additional explicit details would strengthen the presentation. In the revised manuscript, we will expand the section to clearly specify the data splits (standard test sets), the exact pass@k computation method, the number of runs performed, inclusion of error bars or variance measures, and the precise comparison protocol. These changes will make the empirical results more transparent without altering the reported outcomes. revision: yes
Referee: [Training data section] Training data section: The 2-trillion-token project-level corpus is described at a high level with no quantitative decontamination statistics, overlap analysis, or membership-inference results relative to standard code benchmarks. Without this, performance improvements cannot be confidently distinguished from potential memorization or leakage, which is load-bearing for the generalization claim.

Authors: We agree that quantitative decontamination evidence is essential to support claims of generalization. The Training data section describes the project-level corpus construction and quality filtering at a high level. To address this, we will add a dedicated analysis subsection in the revised manuscript that includes overlap statistics with standard benchmarks such as HumanEval and MBPP. We will also discuss the implications for potential leakage and why the scale and diversity of the corpus support generalization. Membership inference was not performed in the original work; we will note this limitation explicitly and suggest it as future work. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark reporting

full rationale

The paper presents an empirical training run (2T tokens, project-level corpus, 16K fill-in-the-blank objective) followed by direct reporting of pass@k scores on standard benchmarks (HumanEval, MBPP, etc.). No equations, derivations, or parameter-fitting steps are described that would reduce the headline performance claims back to quantities defined on the same evaluation data. No self-citation chains, uniqueness theorems, or ansatzes are invoked to support the central results; the claims rest on external, publicly known benchmark protocols rather than internal redefinitions.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The performance claims rest on the unverified quality of the proprietary project-level code corpus, the assumption that standard code benchmarks measure general code intelligence, and the effectiveness of the fill-in-the-blank objective at 16K context.

free parameters (3)

Model parameter counts (1.3B-33B)
Chosen to span a range of scales for comparison.
Training corpus size (2 trillion tokens)
Large scale selected to achieve high performance.
Context window length (16K tokens)
Set to support project-level code understanding.

axioms (2)

domain assumption Pretraining on high-quality project-level code improves code generation and infilling
Invoked to justify the corpus and fill-in-the-blank task choice.
domain assumption Standard code benchmarks are valid proxies for real-world code intelligence
Required for the SOTA and closed-model comparison claims.

pith-pipeline@v0.9.0 · 5490 in / 1529 out tokens · 56879 ms · 2026-05-10T17:19:37.367799+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We introduce the DeepSeek-Coder series... trained from scratch on 2 trillion tokens... high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window
IndisputableMonolith/Foundation/Atomicity.lean exists_sequential_schedule echoes
Algorithm 1 Topological Sort for Dependency Analysis
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We have carried out comprehensive experiments using a variety of public code-related benchmarks... DeepSeek-Coder-Instruct 33B surpasses OpenAI GPT-3.5 Turbo

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
cs.AI 2026-04 conditional novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
cs.CR 2026-04 unverdicted novelty 8.0

The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation
cs.GR 2026-05 unverdicted novelty 7.0

MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
cs.LG 2026-05 unverdicted novelty 7.0

Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
cs.SE 2026-05 unverdicted novelty 7.0

Many reported failures in LLM-based code translation are false negatives due to evaluation pipeline issues such as improper compilation flags, missing library links, and unconfigured runtime environments rather than i...
PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models
cs.SE 2026-04 unverdicted novelty 7.0

PuzzleMark provides a robust and imperceptible watermarking method for code datasets using adaptive variable name concatenation and statistical verification, achieving perfect detection rates with minimal performance impact.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
cs.SE 2026-04 unverdicted novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
Aligned Multi-View Scripts for Universal Chart-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

Introduces an aligned multi-language dataset and a language-conditioned low-rank adapter for generating executable plotting code in Python, R, and LaTeX from chart images.
PlayCoder: Making LLM-Generated GUI Code Playable
cs.SE 2026-04 conditional novelty 7.0

PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
cs.SE 2026-04 unverdicted novelty 7.0

A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...
Evaluating LLMs Code Reasoning Under Real-World Context
cs.SE 2026-04 unverdicted novelty 7.0

R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
cs.SE 2026-04 unverdicted novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
An Iterative Test-and-Repair Framework for Competitive Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
Think Anywhere in Code Generation
cs.SE 2026-03 unverdicted novelty 7.0

Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
cs.SE 2024-05 unverdicted novelty 7.0

SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.
Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Uncertainty Quantification for LLM-based Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
Securing the Dark Matter: A Semantic-Enhanced Neuro-Symbolic Framework for Supply Chain Analysis of Opaque Industrial Software
cs.SE 2026-05 unverdicted novelty 6.0

A neuro-symbolic framework reconstructs semantics from opaque binaries via abstract interpretation, reflexive LLM prompting, typed knowledge graphs, and Graphormer reasoning to outperform baselines in vulnerability de...
SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs
cs.SE 2026-05 unverdicted novelty 6.0

SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning
cs.CR 2026-05 unverdicted novelty 6.0

Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 6.0

REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
cs.SE 2026-04 conditional novelty 6.0

SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
cs.SE 2026-04 unverdicted novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.
Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation
cs.LG 2026-04 unverdicted novelty 6.0

Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
cs.SE 2026-04 unverdicted novelty 6.0

RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo g...
Hybrid Policy Distillation for LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...
PARM: Pipeline-Adapted Reward Model
cs.AI 2026-04 unverdicted novelty 6.0

PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora
cs.SE 2026-04 unverdicted novelty 6.0

CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
cs.AI 2026-04 unverdicted novelty 6.0

EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.
MATRIX: Multi-Layer Code Watermarking via Dual-Channel Constrained Parity-Check Encoding
cs.CR 2026-04 unverdicted novelty 6.0

MATRIX embeds multi-layer watermarks in LLM-generated code via dual-channel constrained parity-check encoding, achieving 99.2% detection accuracy with 0-0.14% functionality loss and 7.7-26.67% better attack robustness...
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
cs.SE 2026-04 unverdicted novelty 6.0

Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
TOPCELL: Topology Optimization of Standard Cell via LLMs
cs.LG 2026-04 unverdicted novelty 6.0

TOPCELL reformulates standard cell topology optimization as an LLM generative task with GRPO fine-tuning, outperforming base models and matching exhaustive solvers with 85.91x speedup in 2nm/7nm industrial flows.
CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference
cs.SE 2026-04 unverdicted novelty 6.0

CoDe-R refines LLM decompiler output via rationale-guided semantic injection and dynamic fallback inference, making a 1.3B model the first to exceed 50% average re-executability on HumanEval-Decompile.
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code
cs.SE 2026-04 unverdicted novelty 6.0

Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis
cs.SE 2026-04 unverdicted novelty 6.0

A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vuln...
DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design
cs.CR 2026-04 unverdicted novelty 6.0

DuCodeMark watermarks code datasets using AST style transformations and repressible poisons for both source-code and decompilation tasks, verified by t-test, with high stealth and a 28.6% performance drop if removed.
Strix: Re-thinking NPU Reliability from a System Perspective
cs.AR 2026-04 unverdicted novelty 6.0

Strix delivers sub-microsecond fault localisation, detection, and correction on NPUs with 1.04x slowdown and minimal hardware cost by system-level re-partitioning and targeted safeguards.
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution
cs.SE 2026-04 unverdicted novelty 6.0

REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair
cs.SE 2026-04 unverdicted novelty 6.0

PAFT improves LLM-based program repair pass rates by up to 65.6% while cutting average edit distance by up to 32.6% through explicit preservation signals and curriculum training.
Runtime Execution Traces Guided Automated Program Repair with Multi-Agent Debate
cs.SE 2026-04 unverdicted novelty 6.0

TraceRepair deploys a probe agent for runtime snapshots and a committee of agents for cross-verification to fix 392 defects on Defects4J, outperforming prior LLM-based automated program repair methods.
A Taxonomy of Programming Languages for Code Generation
cs.CL 2026-03 accept novelty 6.0

The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
StarCoder 2 and The Stack v2: The Next Generation
cs.SE 2024-02 accept novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
cs.SE 2026-05 unverdicted novelty 5.0

LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study
cs.SE 2026-05 conditional novelty 5.0

Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
cs.CR 2026-05 accept novelty 5.0

The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
cs.SE 2026-05 unverdicted novelty 5.0

A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.
Learning Generalizable Multimodal Representations for Software Vulnerability Detection
cs.SE 2026-04 unverdicted novelty 5.0

MultiVul uses multimodal contrastive learning to align code and comment representations, yielding up to 27% F1 gains on vulnerability detection benchmarks over prompting and code-only baselines.
PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection
cs.SE 2026-04 unverdicted novelty 5.0

Controlled experiments show PLM-GNN hybrids improve code tasks over GNN-only baselines, with PLM source having larger impact than GNN backbone.
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
cs.SE 2026-04 unverdicted novelty 5.0

KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent his...
VerilogCL: A Contrastive Learning Framework for Robust LLM-Based Verilog Generation
cs.AR 2026-04 unverdicted novelty 5.0

VerilogCL applies contrastive learning with minimal-error data pairs and a proactive screening module to improve compilation success and functional correctness of 7B LLM-generated Verilog over open-source and commerci...
Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective
cs.CR 2026-04 unverdicted novelty 5.0

BPE tokenization creates gibberish bias in CLLMs, causing secrets with high character entropy but low token entropy to be preferentially memorized due to training data distribution shifts.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 73 Pith papers · 12 internal anchors

[1]

L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis, N. Muennighoff, M. Mishra, A. Gu, M. Dey, et al. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988,

work page arXiv
[2]

Efficient training of language models to fill in the middle

M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255,

work page arXiv
[3]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

S. Chen, S. Wong, L. Chen, and Y. Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595,

work page internal anchor Pith review arXiv
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P . Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

work page internal anchor Pith review arXiv
[8]

arXiv preprint arXiv:2204.05999 , year=

D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999,

work page arXiv
[9]

URL https: //goo.gle/GeminiPaper. Z. Gou, Z. Shao, Y. Gong, Y. Yang, M. Huang, N. Duan, W. Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452,

work page arXiv
[10]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Mea- suring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

18 R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161,

work page internal anchor Pith review arXiv
[12]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474,

work page internal anchor Pith review arXiv
[13]

Patel, S

A. Patel, S. Bhattamishra, and N. Goyal. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080– 2094,

work page 2021
[14]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Neural Machine Translation of Rare Words with Subword Units

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909,

work page internal anchor Pith review arXiv
[16]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, , and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,

work page internal anchor Pith review arXiv
[17]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Y. Wang, W. Wang, S. Joty, and S. C. Hoi. Codet5: Identifier-aware unified pre-trained encoder- decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859,

work page arXiv
[19]

SELECT * from students

20 A. Cases of Chatting with DeepSeek-Coder-Instruct We will present two cases of interactions with DeepSeek-Coder-Instruct, with one involving a multi-turn conversation about creating a database and performing data analysis, and the other centered around using a model to solve a sample problem from LeetCode. In the first scenario, depicted in Figure 5, w...

work page 2023
[20]

Then, … Submit Code Figure 6 | An example of solving LeetCode Problem

* n # Calculate the in-degree for each team for u in adj_list: for v in adj_list[u]: in_degree[v] += 1 # Initialize a list to keep track of the teams with no incoming edges no_incoming_edges= [ifor iin range(n) if in_degree[i] == 0] # If there is more than one team with no incoming edges, there is no unique champion if len(no_incoming_edges) != 1: return ...

work page 2000