Recognition: 3 theorem links
· Lean TheoremDeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3
The pith
Open-source code models trained on 2 trillion tokens surpass Codex and GPT-3.5 on programming benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present the DeepSeek-Coder series of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. Pre-trained on a high-quality project-level code corpus and employing a fill-in-the-blank task with a 16K window, the models achieve state-of-the-art performance among open-source code models and surpass existing closed-source models like Codex and GPT-3.5 across multiple benchmarks. The models are released under a permissive license that allows both research and unrestricted commercial use.
What carries the argument
The DeepSeek-Coder series, built by pre-training on a high-quality project-level code corpus and applying a 16K-context fill-in-the-blank objective that strengthens code generation and infilling.
If this is right
- Researchers can freely study, modify, and build upon models that match or exceed leading closed-source code systems.
- Commercial developers can integrate the models into products without licensing fees or usage restrictions.
- The demonstrated gains from project-level data and long-context infilling training provide a concrete recipe others can replicate or extend.
- Continued open development of these models can accelerate progress in automated programming assistance.
Where Pith is reading between the lines
- Training on complete projects instead of isolated functions may be necessary for models to manage the dependencies found in actual software systems.
- The permissive license could encourage community-driven refinements that mirror the evolution of open-source software ecosystems.
- Further increases in context length beyond 16K tokens could yield additional improvements when models process entire codebases.
Load-bearing premise
The chosen benchmarks and evaluation protocol give a fair and generalizable measure of code intelligence that extends beyond the specific test sets used.
What would settle it
An independent evaluation on a fresh collection of real-world coding tasks drawn from projects completed after the training data cutoff, showing DeepSeek-Coder underperforming GPT-3.5, would falsify the superiority claim.
read the original abstract
The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the DeepSeek-Coder series of open-source code LLMs (1.3B to 33B parameters) trained from scratch on 2 trillion tokens of high-quality project-level code data, using a fill-in-the-blank task with a 16K context window. It claims these models achieve state-of-the-art performance among open-source code models across multiple benchmarks and surpass closed-source models such as Codex and GPT-3.5, while being released under a permissive license.
Significance. If the reported performance gains hold after accounting for evaluation details and data contamination risks, this would represent a meaningful advance in open code intelligence, providing accessible high-performing models that could support broader research and commercial applications in software development.
major comments (2)
- [Evaluation section] Evaluation section: The central claim that DeepSeek-Coder surpasses closed-source models like Codex and GPT-3.5 (and achieves SOTA among open models) is asserted without any reported details on the specific benchmarks (e.g., HumanEval, MBPP), data splits, pass@k computation, number of runs, error bars, or exact comparison methodology. This absence makes the headline empirical results unverifiable from the manuscript.
- [Training data section] Training data section: The 2-trillion-token project-level corpus is described at a high level with no quantitative decontamination statistics, overlap analysis, or membership-inference results relative to standard code benchmarks. Without this, performance improvements cannot be confidently distinguished from potential memorization or leakage, which is load-bearing for the generalization claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. We will incorporate revisions to address the concerns about evaluation details and training data analysis.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: The central claim that DeepSeek-Coder surpasses closed-source models like Codex and GPT-3.5 (and achieves SOTA among open models) is asserted without any reported details on the specific benchmarks (e.g., HumanEval, MBPP), data splits, pass@k computation, number of runs, error bars, or exact comparison methodology. This absence makes the headline empirical results unverifiable from the manuscript.
Authors: We appreciate this observation on verifiability. The Evaluation section of the manuscript reports results across benchmarks including HumanEval and MBPP using the pass@k metric and provides comparisons to Codex and GPT-3.5 based on their published numbers. However, we agree that additional explicit details would strengthen the presentation. In the revised manuscript, we will expand the section to clearly specify the data splits (standard test sets), the exact pass@k computation method, the number of runs performed, inclusion of error bars or variance measures, and the precise comparison protocol. These changes will make the empirical results more transparent without altering the reported outcomes. revision: yes
-
Referee: [Training data section] Training data section: The 2-trillion-token project-level corpus is described at a high level with no quantitative decontamination statistics, overlap analysis, or membership-inference results relative to standard code benchmarks. Without this, performance improvements cannot be confidently distinguished from potential memorization or leakage, which is load-bearing for the generalization claim.
Authors: We agree that quantitative decontamination evidence is essential to support claims of generalization. The Training data section describes the project-level corpus construction and quality filtering at a high level. To address this, we will add a dedicated analysis subsection in the revised manuscript that includes overlap statistics with standard benchmarks such as HumanEval and MBPP. We will also discuss the implications for potential leakage and why the scale and diversity of the corpus support generalization. Membership inference was not performed in the original work; we will note this limitation explicitly and suggest it as future work. revision: yes
Circularity Check
No circularity: purely empirical benchmark reporting
full rationale
The paper presents an empirical training run (2T tokens, project-level corpus, 16K fill-in-the-blank objective) followed by direct reporting of pass@k scores on standard benchmarks (HumanEval, MBPP, etc.). No equations, derivations, or parameter-fitting steps are described that would reduce the headline performance claims back to quantities defined on the same evaluation data. No self-citation chains, uniqueness theorems, or ansatzes are invoked to support the central results; the claims rest on external, publicly known benchmark protocols rather than internal redefinitions.
Axiom & Free-Parameter Ledger
free parameters (3)
- Model parameter counts (1.3B-33B)
- Training corpus size (2 trillion tokens)
- Context window length (16K tokens)
axioms (2)
- domain assumption Pretraining on high-quality project-level code improves code generation and infilling
- domain assumption Standard code benchmarks are valid proxies for real-world code intelligence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe introduce the DeepSeek-Coder series... trained from scratch on 2 trillion tokens... high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window
-
IndisputableMonolith/Foundation/Atomicity.leanexists_sequential_schedule echoesAlgorithm 1 Topological Sort for Dependency Analysis
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearWe have carried out comprehensive experiments using a variety of public code-related benchmarks... DeepSeek-Coder-Instruct 33B surpasses OpenAI GPT-3.5 Turbo
Forward citations
Cited by 60 Pith papers
-
AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
-
Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation
MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...
-
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
Many reported failures in LLM-based code translation are false negatives due to evaluation pipeline issues such as improper compilation flags, missing library links, and unconfigured runtime environments rather than i...
-
PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models
PuzzleMark provides a robust and imperceptible watermarking method for code datasets using adaptive variable name concatenation and statistical verification, achieving perfect detection rates with minimal performance impact.
-
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
-
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
-
Aligned Multi-View Scripts for Universal Chart-to-Code Generation
Introduces an aligned multi-language dataset and a language-conditioned low-rank adapter for generating executable plotting code in Python, R, and LaTeX from chart images.
-
PlayCoder: Making LLM-Generated GUI Code Playable
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
-
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
-
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...
-
Evaluating LLMs Code Reasoning Under Real-World Context
R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
-
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
-
An Iterative Test-and-Repair Framework for Competitive Code Generation
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
-
Think Anywhere in Code Generation
Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
-
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
Uncertainty Quantification for LLM-based Code Generation
RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
-
Securing the Dark Matter: A Semantic-Enhanced Neuro-Symbolic Framework for Supply Chain Analysis of Opaque Industrial Software
A neuro-symbolic framework reconstructs semantics from opaque binaries via abstract interpretation, reflexive LLM prompting, typed knowledge graphs, and Graphormer reasoning to outperform baselines in vulnerability de...
-
SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs
SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
-
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
-
LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning
Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
-
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
-
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...
-
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
-
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation
MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.
-
Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation
Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.
-
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo g...
-
Hybrid Policy Distillation for LLMs
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...
-
PARM: Pipeline-Adapted Reward Model
PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
-
CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora
CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.
-
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.
-
MATRIX: Multi-Layer Code Watermarking via Dual-Channel Constrained Parity-Check Encoding
MATRIX embeds multi-layer watermarks in LLM-generated code via dual-channel constrained parity-check encoding, achieving 99.2% detection accuracy with 0-0.14% functionality loss and 7.7-26.67% better attack robustness...
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
TOPCELL: Topology Optimization of Standard Cell via LLMs
TOPCELL reformulates standard cell topology optimization as an LLM generative task with GRPO fine-tuning, outperforming base models and matching exhaustive solvers with 85.91x speedup in 2nm/7nm industrial flows.
-
CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference
CoDe-R refines LLM decompiler output via rationale-guided semantic injection and dynamic fallback inference, making a 1.3B model the first to exceed 50% average re-executability on HumanEval-Decompile.
-
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
-
Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis
A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vuln...
-
DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design
DuCodeMark watermarks code datasets using AST style transformations and repressible poisons for both source-code and decompilation tasks, verified by t-test, with high stealth and a 28.6% performance drop if removed.
-
Strix: Re-thinking NPU Reliability from a System Perspective
Strix delivers sub-microsecond fault localisation, detection, and correction on NPUs with 1.04x slowdown and minimal hardware cost by system-level re-partitioning and targeted safeguards.
-
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
-
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
-
PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair
PAFT improves LLM-based program repair pass rates by up to 65.6% while cutting average edit distance by up to 32.6% through explicit preservation signals and curriculum training.
-
Runtime Execution Traces Guided Automated Program Repair with Multi-Agent Debate
TraceRepair deploys a probe agent for runtime snapshots and a committee of agents for cross-verification to fix 392 defects on Defects4J, outperforming prior LLM-based automated program repair methods.
-
A Taxonomy of Programming Languages for Code Generation
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
-
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study
Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.
-
Learning Generalizable Multimodal Representations for Software Vulnerability Detection
MultiVul uses multimodal contrastive learning to align code and comment representations, yielding up to 27% F1 gains on vulnerability detection benchmarks over prompting and code-only baselines.
-
PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection
Controlled experiments show PLM-GNN hybrids improve code tasks over GNN-only baselines, with PLM source having larger impact than GNN backbone.
-
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent his...
-
VerilogCL: A Contrastive Learning Framework for Robust LLM-Based Verilog Generation
VerilogCL applies contrastive learning with minimal-error data pairs and a proactive screening module to improve compilation success and functional correctness of 7B LLM-generated Verilog over open-source and commerci...
-
Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective
BPE tokenization creates gibberish bias in CLLMs, causing secrets with high character entropy but low token entropy to be preferentially memorized due to training data distribution shifts.
Reference graph
Works this paper leans on
- [1]
-
[2]
Efficient training of language models to fill in the middle
M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255,
-
[3]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
S. Chen, S. Wong, L. Chen, and Y. Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595,
work page internal anchor Pith review arXiv
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P . Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,
work page internal anchor Pith review arXiv
-
[8]
arXiv preprint arXiv:2204.05999 , year=
D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999,
- [9]
-
[10]
Measuring Mathematical Problem Solving With the MATH Dataset
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Mea- suring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
18 R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161,
work page internal anchor Pith review arXiv
-
[12]
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474,
work page internal anchor Pith review arXiv
- [13]
-
[14]
Code Llama: Open Foundation Models for Code
B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Neural Machine Translation of Rare Words with Subword Units
R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909,
work page internal anchor Pith review arXiv
-
[16]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, , and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,
work page internal anchor Pith review arXiv
-
[17]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
- [18]
-
[19]
20 A. Cases of Chatting with DeepSeek-Coder-Instruct We will present two cases of interactions with DeepSeek-Coder-Instruct, with one involving a multi-turn conversation about creating a database and performing data analysis, and the other centered around using a model to solve a sample problem from LeetCode. In the first scenario, depicted in Figure 5, w...
work page 2023
-
[20]
Then, … Submit Code Figure 6 | An example of solving LeetCode Problem
* n # Calculate the in-degree for each team for u in adj_list: for v in adj_list[u]: in_degree[v] += 1 # Initialize a list to keep track of the teams with no incoming edges no_incoming_edges= [ifor iin range(n) if in_degree[i] == 0] # If there is more than one team with no incoming edges, there is no unique champion if len(no_incoming_edges) != 1: return ...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.