hub

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI: Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng · 2024 · cs.CL · arXiv 2401.02954

39 Pith papers cite this work. Polarity classification is still indexing.

39 Pith papers citing it

open full Pith review browse 39 citing papers arXiv PDF

abstract

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 1

citation-polarity summary

baseline 1

claims ledger

abstract The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have de

co-cited works

representative citing papers

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

cs.CV · 2024-06-10 · conditional · novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.

Causal Bias Detection in Generative Artifical Intelligence

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.

Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

Training continuously-coupled reconfigurable photonic chips with quantum machine learning

quant-ph · 2026-05-11 · unverdicted · novelty 6.0

A black-box machine learning technique trains continuously-coupled photonic waveguide arrays to implement target unitaries using limited single- and two-photon measurements without requiring detailed internal models.

Predicting Large Model Test Losses with a Noisy Quadratic System

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.

DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing

cs.AR · 2026-05-09 · unverdicted · novelty 6.0

DSPE is an edge processor that achieves 109.4 TFLOPS/W for DeepSeek inference using Merkle tree-based incremental pruning, multi-stage boothing lookup, and dynamic adaptive posit processing.

RELO: Reinforcement Learning to Localize for Visual Object Tracking

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.

Why Does Agentic Safety Fail to Generalize Across Tasks?

cs.LG · 2026-05-07 · conditional · novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

cs.CL · 2026-05-04 · unverdicted · novelty 6.0

InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger scales up to 7B models and 425B tokens.

Rethinking LLM Ensembling from the Perspective of Mixture Models

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.

ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

ReaGeo is an end-to-end LLM framework for geocoding that uses geohash text generation, Chain-of-Thought spatial reasoning, and distance-based RL to accurately predict points and regions from explicit and vague queries.

Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbiased gradients, delivering 1.7x-3.0x wall-clock speedup on LLaMA and OPT models.

Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.

Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.

AFGNN: API Misuse Detection using Graph Neural Networks and Clustering

cs.SE · 2026-04-09 · unverdicted · novelty 6.0

AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.

Muon is Scalable for LLM Training

cs.LG · 2025-02-24 · unverdicted · novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

cs.CL · 2024-04-09 · conditional · novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.

Are We on the Right Way for Evaluating Large Vision-Language Models?

cs.CV · 2024-03-29 · conditional · novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

cs.LG · 2026-04-20 · unverdicted · novelty 5.0

Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.

citing papers explorer

Showing 39 of 39 citing papers.

Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 33 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages cs.CL · 2026-05-13 · unverdicted · none · ref 33 · internal anchor
A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 3 · internal anchor
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 13 · internal anchor
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation cs.CL · 2026-05-12 · unverdicted · none · ref 29 · internal anchor
SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.
Causal Bias Detection in Generative Artifical Intelligence cs.AI · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks cs.AI · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
Training continuously-coupled reconfigurable photonic chips with quantum machine learning quant-ph · 2026-05-11 · unverdicted · none · ref 36 · internal anchor
A black-box machine learning technique trains continuously-coupled photonic waveguide arrays to implement target unitaries using limited single- and two-photon measurements without requiring detailed internal models.
Predicting Large Model Test Losses with a Noisy Quadratic System cs.LG · 2026-05-09 · unverdicted · none · ref 9 · internal anchor
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing cs.AR · 2026-05-09 · unverdicted · none · ref 1 · internal anchor
DSPE is an edge processor that achieves 109.4 TFLOPS/W for DeepSeek inference using Merkle tree-based incremental pruning, multi-stage boothing lookup, and dynamic adaptive posit processing.
RELO: Reinforcement Learning to Localize for Visual Object Tracking cs.CV · 2026-05-08 · unverdicted · none · ref 67 · internal anchor
RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 16 · internal anchor
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition cs.CL · 2026-05-04 · unverdicted · none · ref 40 · internal anchor
InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger scales up to 7B models and 425B tokens.
Rethinking LLM Ensembling from the Perspective of Mixture Models cs.LG · 2026-05-01 · unverdicted · none · ref 6 · internal anchor
ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.
ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs cs.AI · 2026-04-23 · unverdicted · none · ref 40 · internal anchor
ReaGeo is an end-to-end LLM framework for geocoding that uses geohash text generation, Chain-of-Thought spatial reasoning, and distance-based RL to accurately predict points and regions from explicit and vague queries.
Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling cs.LG · 2026-04-20 · unverdicted · none · ref 48 · internal anchor
AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbiased gradients, delivering 1.7x-3.0x wall-clock speedup on LLaMA and OPT models.
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching cs.AI · 2026-04-16 · unverdicted · none · ref 3 · internal anchor
Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models cs.LG · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
AFGNN: API Misuse Detection using Graph Neural Networks and Clustering cs.SE · 2026-04-09 · unverdicted · none · ref 15 · internal anchor
AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.
Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 93 · internal anchor
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies cs.CL · 2024-04-09 · conditional · none · ref 6 · internal anchor
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 3 · internal anchor
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 245 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling cs.LG · 2026-04-20 · unverdicted · none · ref 39 · internal anchor
Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.
Why Do Vision Language Models Struggle To Recognize Human Emotions? cs.CV · 2026-04-16 · unverdicted · none · ref 8 · internal anchor
VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using language summaries of skipped frames is proposed to mitigate this.
Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation cs.CV · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
A latent diffusion model conditioned on line drawings estimates dense depth to reconstruct 3D wireframes, reporting 5.3% average depth error after training on over one million pairs.
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability cs.SE · 2026-04-15 · unverdicted · none · ref 6 · internal anchor
The Cognitive Circuit Breaker detects LLM hallucinations by computing the Cognitive Dissonance Delta between semantic confidence and latent certainty from hidden states, adding negligible overhead.
RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement cs.CR · 2026-04-08 · unverdicted · none · ref 6 · internal anchor
RefineRAG achieves 90% attack success on NQ by generating toxic seeds then optimizing them via retriever-in-the-loop word refinement, outperforming prior methods on effectiveness and naturalness.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 54 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence cs.SE · 2024-01-25 · unverdicted · none · ref 7 · internal anchor
DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cs.CL · 2024-01-11 · unverdicted · none · ref 130 · internal anchor
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization cs.LG · 2026-05-06 · unverdicted · none · ref 31 · internal anchor
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
Agentic Application in Power Grid Static Analysis: Automatic Code Generation and Error Correction eess.SY · 2026-04-11 · unverdicted · none · ref 9 · internal anchor
An LLM agent with static pre-check, dynamic feedback, and semantic validation generates MATPOWER code from natural language for power grid analysis at 82.38% fidelity.
Identifying Topological Invariants of Non-Hermitian Systems via Domain-Adaptive Multimodal Model for Mathematics cond-mat.other · 2026-04-08 · unverdicted · none · ref 56 · internal anchor
A multimodal model with Qwen Math backbone identifies topological invariants of non-Hermitian systems from eigenvalues and eigenvectors in momentum space.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 8 · internal anchor
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
DeepSeek-VL: Towards Real-World Vision-Language Understanding cs.AI · 2024-03-08 · unverdicted · none · ref 8 · internal anchor
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.
TinyLlama: An Open-Source Small Language Model cs.CL · 2024-01-04 · accept · none · ref 5 · internal anchor
TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling cs.AI · 2025-01-29 · conditional · none · ref 3 · internal anchor
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 26 · internal anchor
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer