super hub Canonical reference

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Dejian Yang, Kai Dong, Qihao Zhu, Wentao Zhang, Zhenda Xie · 2024 · cs.SE · arXiv 2401.14196

Canonical reference. 74% of citing Pith papers cite this work as background.

191 Pith papers citing it

Background 74% of classified citations

open full Pith review browse 191 citing papers more from Daya Guo arXiv PDF

abstract

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 method 6 dataset 2 baseline 1

citation-polarity summary

background 29 use method 6 use dataset 2 baseline 1 support 1

claims ledger

abstract The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder

authors

Daya Guo Dejian Yang Kai Dong Qihao Zhu Wentao Zhang Zhenda Xie

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.

Mitigating Package Hallucinations in Large Language Models via Model Editing

cs.SE · 2026-07-02 · unverdicted · novelty 7.0

BOUND refines LLMs' package-validity boundary via targeted editing to cut package hallucination rates by 79.9% on edit prompts and 65.4% on unseen prompts in recommendation tasks while generalizing to code generation.

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

NL2Scratch: An Executable Benchmark and Evaluation for Block-Based Programming

cs.CL · 2026-06-20 · unverdicted · novelty 7.0

NL2Scratch supplies an executable benchmark of 311,648 NL-Scratch pairs and the SAC metric, showing LLMs with high lexical F1 often fail semantic alignment on actions, conditions, and numbers.

Mat-Pref: Verifiable-Reward Training Improves Compositional Reasoning in Inorganic Materials

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Mat-Pref benchmark shows GRPO after SFT lets Qwen3-8B reach 65-72% on compositional materials reasoning tasks, exceeding zero-shot 235B models on held-out structure families and cross-property transfer.

The Alignment Problem in Constrained Code Generation

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

Incomplete constrainers in constrained decoding push LLMs into low-probability program regions, making unconstrained decoding outperform constrained decoding on functional correctness across seven models and three benchmarks.

CODEBLOCK: Learning to Supervise Code at the Right Granularity

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

CodeBlock partitions code responses into syntactically coherent blocks, scores them with generalized cross-entropy and data-flow signals, and applies sparse supervision to achieve higher pass@1 than full SFT using 1.9% of tokens on six benchmarks.

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

OpenRTLSet supplies 131k+ Verilog samples with AI-generated descriptions to enable fine-tuning of LLMs for hardware module design.

PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees

cs.CR · 2026-06-08 · unverdicted · novelty 7.0

PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

cs.AI · 2026-06-07 · unverdicted · novelty 7.0

Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.

SkelDPO: A Skeleton-Guided Direct Preference Optimization Framework for Efficient Code Generation

cs.SE · 2026-06-05 · unverdicted · novelty 7.0

SkelDPO improves code generation efficiency by 2-7% over prior DPO methods via joint preference losses on full code and efficiency-critical skeletons.

Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Using frontier models to synthesize plausible-but-wrong FIM completions as hard negatives for SFT improves Delulu exact match by +18.8 and edit similarity by +0.22 on Qwen2.5-Coder-7B while also lifting HumanEval-Infilling and SAFIM.

Trustworthy Software Project Generation : a Case Study with an Interactive Theorem Prover

cs.SE · 2026-05-25 · conditional · novelty 7.0

An LLM agent with Rocq backend automatically builds a verified RISC-V RV32I interpreter (1859 lines Rocq, 2848 lines extracted C++) that passes 265 tests and 12-hour fuzzing, while a Dafny backend fails.

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.

Constrained Code Generation with Discrete Diffusion

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

Constrained Diffusion for Code (CDC) integrates constraint satisfaction into the reverse denoising process of discrete diffusion models via constraint-aware operators that use optimization and program analysis to steer generation toward feasible programs.

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

cs.AR · 2026-05-13 · unverdicted · novelty 7.0

Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation

cs.GR · 2026-05-09 · unverdicted · novelty 7.0

MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topology, and region limits.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting output length by 75-85%.

PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

cs.SE · 2026-04-30 · unverdicted · novelty 7.0

PuzzleMark provides a robust and imperceptible watermarking method for code datasets using adaptive variable name concatenation and statistical verification, achieving perfect detection rates with minimal performance impact.

RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.

EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

EXPO-SQL improves Text-to-SQL by using clause-level rewards derived from execution error messages and incremental clause execution instead of uniform query-level rewards.

citing papers explorer

Showing 16 of 16 citing papers after filters.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot cs.AI · 2026-04-15 · conditional · none · ref 6 · internal anchor
AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
Trustworthy Software Project Generation : a Case Study with an Interactive Theorem Prover cs.SE · 2026-05-25 · conditional · none · ref 1 · internal anchor
An LLM agent with Rocq backend automatically builds a verified RISC-V RV32I interpreter (1859 lines Rocq, 2848 lines extracted C++) that passes 265 tests and 12-hour fuzzing, while a Dafny backend fails.
PlayCoder: Making LLM-Generated GUI Code Playable cs.SE · 2026-04-21 · conditional · none · ref 14 · internal anchor
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
Efficient Remote KV Cache Reuse with GPU-native Video Codec cs.DC · 2026-02-10 · conditional · none · ref 32 · internal anchor
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
"Tab, Tab, Bug": Security Pitfalls of Next Edit Suggestions in AI-Integrated IDEs cs.CR · 2026-02-06 · conditional · none · ref 18 · internal anchor
NES systems in AI IDEs expand attack surfaces via context poisoning from imperceptible actions and global codebase retrieval, with professional developers largely unaware of the risks.
MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair cs.SE · 2026-05-17 · conditional · none · ref 15 · internal anchor
MemRepair is a hierarchical memory-augmented agent framework that raises repository-level vulnerability repair rates to 58.0-58.2% on Python/Go/JS benchmarks and 30.58% on C++ by combining history, pattern, and refinement memories with iterative feedback.
Revisiting DAgger in the Era of LLM-Agents cs.LG · 2026-05-13 · conditional · none · ref 15 · internal anchor
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis cs.SE · 2026-04-27 · conditional · none · ref 10 · internal anchor
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.
Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design cs.CV · 2025-12-30 · conditional · none · ref 5 · internal anchor
Three-example few-shot prompting optimizes LLM-generated vision architectures while a whitespace-normalized hash provides 100x faster duplicate detection than AST parsing across seven benchmarks.
Fine-Tuning Code Language Models to Detect Cross-Language Bugs cs.SE · 2025-07-29 · conditional · none · ref 23 · internal anchor
Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.
EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair cs.SE · 2025-06-12 · conditional · none · ref 13 · internal anchor
ExpeRepair improves LLM-based repository-level program repair by maintaining episodic memory of concrete fixes and semantic memory of abstract insights, reaching 60.3% and 74.6% pass@1 on SWE-Bench Lite and Verified.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 140 · internal anchor
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study cs.SE · 2026-05-06 · conditional · none · ref 6 · internal anchor
Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices cs.SE · 2026-04-24 · conditional · none · ref 17 · internal anchor
RealBench is a repo-level code generation benchmark pairing UML diagrams with natural language requirements, revealing that LLMs perform significantly worse on realistic repo-level tasks than existing benchmarks suggest.
Enhancing Large Language Model-Based Systems for End-to-End Circuit Analysis Problem Solving cs.CY · 2025-12-10 · conditional · none · ref 9 · internal anchor
Hybrid pipeline using YOLO vision and ngspice verification raises circuit analysis accuracy from Gemini's 79.52% baseline to 97.59%, with similar gains on hand-drawn diagrams.
CHESS: Contextual Harnessing for Efficient SQL Synthesis cs.LG · 2024-05-27 · conditional · none · ref 79 · internal anchor
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer