DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Pith reviewed 2026-05-11 06:03 UTC · model grok-4.3
The pith
DeepSeek LLM 67B surpasses LLaMA-2 70B on code, mathematics and reasoning benchmarks, with its chat version exceeding GPT-3.5 in open-ended evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Guided by our distinctive findings on scaling laws, we train DeepSeek LLM base models in 7B and 67B configurations on a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further apply supervised fine-tuning and direct preference optimization to produce DeepSeek Chat models. Evaluation shows that DeepSeek LLM 67B surpasses LLaMA-2 70B across various benchmarks with particular strength in code, mathematics and reasoning, while open-ended evaluations indicate that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.
What carries the argument
Distinctive findings on scaling laws that guide effective training in 7B and 67B sizes, implemented through a continuously expanding 2 trillion token dataset plus supervised fine-tuning and direct preference optimization.
If this is right
- DeepSeek LLM 67B records higher scores than LLaMA-2 70B on standard benchmarks, especially those involving code, mathematics and reasoning.
- The 67B chat model achieves better results than GPT-3.5 when evaluated on open-ended tasks.
- The same scaling approach with ongoing data growth can be applied to produce further improvements in open-source models at these sizes.
- Long-term expansion of the training dataset supports continued progress without requiring changes to the core training configuration.
Where Pith is reading between the lines
- If the identified scaling patterns persist, further growth of the token dataset beyond the current 2 trillion could yield additional performance lifts in the same model sizes.
- Open projects following this data-first, long-horizon route may gradually close capability gaps with closed models on reasoning-heavy tasks.
- Re-running the comparisons on entirely new benchmark suites would test whether the observed advantages generalize beyond the reported set.
- The emphasis on sustained data collection could encourage similar multi-year efforts in other open-source language-model initiatives.
Load-bearing premise
The selected benchmarks and open-ended evaluations measure genuine model capability without undisclosed overlap in training data or advantages in methodology.
What would settle it
Independent re-testing on a fresh set of benchmarks withheld from the original evaluation that shows DeepSeek LLM 67B no longer outperforming LLaMA-2 70B or its chat version no longer exceeding GPT-3.5.
read the original abstract
The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepSeek LLM, an open-source project focused on long-term scaling of LLMs. It reports empirical studies of scaling laws for 7B and 67B models, describes pre-training a base model on a 2-trillion-token dataset that continues to grow, and applies supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) to create chat variants. The central empirical claims are that DeepSeek LLM 67B outperforms LLaMA-2 70B on code, mathematics, and reasoning benchmarks, and that the 67B Chat model shows superior performance to GPT-3.5 in open-ended evaluations.
Significance. If the benchmark results hold under scrutiny, the work is significant for advancing reproducible open-source LLMs by releasing competitive 67B-scale models trained with explicit long-term data scaling. The inclusion of scaling-law experiments, dataset construction details, and decontamination protocols in the methods section provides a useful reference for the community and supports the reported performance deltas.
major comments (2)
- [Evaluation] Open-ended evaluation section: the claim that DeepSeek LLM 67B Chat exhibits superior performance to GPT-3.5 rests on unspecified details of the evaluation protocol (prompting strategy, judge model or human raters, and any agreement metrics). Without these, the result cannot be independently verified and is load-bearing for the chat-model contribution.
- [§5] Benchmark results (tables in §5): while decontamination steps are described, the paper does not report the fraction of test-set overlap removed or provide before/after scores; this leaves open the possibility that domain-specific gains (code/math) partly reflect data leakage rather than model capability.
minor comments (3)
- [Scaling Laws] Figure captions for scaling curves should explicitly list the fitted exponents and any confidence intervals; current plots are difficult to reproduce from the text alone.
- [Abstract] The abstract uses 'longtermism' without definition; a one-sentence gloss would improve accessibility for readers outside the immediate subfield.
- [Evaluation] Several benchmark tables lack standard deviations or number of runs; adding these would strengthen the statistical interpretation of the reported deltas over LLaMA-2.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive overall assessment. We address each major comment below, indicating where revisions will be made to enhance verifiability and transparency.
read point-by-point responses
-
Referee: [Evaluation] Open-ended evaluation section: the claim that DeepSeek LLM 67B Chat exhibits superior performance to GPT-3.5 rests on unspecified details of the evaluation protocol (prompting strategy, judge model or human raters, and any agreement metrics). Without these, the result cannot be independently verified and is load-bearing for the chat-model contribution.
Authors: We agree that full specification of the evaluation protocol is necessary for independent verification of the open-ended results. In the revised manuscript we will expand the relevant section to detail the prompting strategy, the judge model employed, the involvement of human raters (if any), and quantitative agreement metrics such as inter-rater reliability scores. These additions will directly support the claim of superior performance relative to GPT-3.5. revision: yes
-
Referee: [§5] Benchmark results (tables in §5): while decontamination steps are described, the paper does not report the fraction of test-set overlap removed or provide before/after scores; this leaves open the possibility that domain-specific gains (code/math) partly reflect data leakage rather than model capability.
Authors: We acknowledge the value of quantifying the decontamination impact. We will revise the methods and results sections to report the fraction of test-set overlap removed for each benchmark category. However, providing complete before/after benchmark scores would require retraining the 67B model on the full 2-trillion-token corpus without decontamination, which is computationally prohibitive. We will instead clarify that the described decontamination procedure was applied uniformly and that performance advantages appear consistently across diverse benchmarks. revision: partial
- Provision of before/after benchmark scores comparing models trained with and without decontamination, due to the prohibitive computational cost of retraining at 2-trillion-token scale.
Circularity Check
No significant circularity; empirical benchmark results are self-contained
full rationale
The paper's core claims consist of observed performance deltas on external benchmarks (code, math, reasoning, open-ended chat) after training a 67B model on an expanding 2T-token corpus followed by SFT+DPO. Scaling-law experiments are described as guiding dataset and model choices but do not reduce any reported result to a fitted parameter renamed as a prediction; the evaluation protocols, decontamination steps, and few-shot settings are stated explicitly and independently of the final scores. No self-definitional equations, load-bearing self-citations, or ansatz smuggling appear in the derivation chain.
Axiom & Free-Parameter Ledger
free parameters (2)
- model scale
- pre-training data volume
axioms (2)
- domain assumption Scaling laws reliably predict performance gains with increased model size and data
- domain assumption SFT followed by DPO produces aligned chat models that generalize on benchmarks
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning.
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective.
-
IndisputableMonolith.Foundation.LedgerForcingconservation_from_balance unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
Derives a blockwise resolvent-style attention operator that exploits structured sparsity for subquadratic O(n^{4/3}d) entity tracking while matching dense accuracy.
-
Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign
Peak-Detector uses instruction-tuned LLMs and a condensed peak-representation of time-series data to achieve robust cross-modal peak detection with self-generated explanations across ECG, PPG, BCG, and BSG signals.
-
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
-
Causal Bias Detection in Generative Artificial Intelligence
Develops a causal framework unifying generative AI fairness with standard ML, with new decompositions, identification conditions, and estimators demonstrated on LLM race and gender bias.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?
The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
-
Obliviator Reveals the Cost of Nonlinear Guardedness in Concept Erasure
Obliviator introduces an iterative kernel-based optimization for nonlinear concept erasure that quantifies the utility cost of guarding against nonlinear adversaries and outperforms prior methods on trade-off curves.
-
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
UniGeoSeg releases the first million-scale dataset for instruction-driven remote sensing segmentation and a unified model that achieves state-of-the-art results with strong zero-shot generalization.
-
Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models
LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.
-
Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs
DORI benchmark shows top vision-language models reach only 54.2% accuracy on coarse orientation tasks and 33% on granular judgments, with sharp drops on reference-frame shifts and compound rotations.
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
-
Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer
Task prompt vectors, formed by subtracting random initialization from tuned soft prompts, support low-resource initialization and arithmetic combination across tasks on 12 NLU datasets while remaining independent of i...
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
CodeMind: Evaluating Large Language Models for Code Reasoning
CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.
-
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing t...
-
Reading Calibrated Uncertainty from Language Model Trajectories
Geometric features from per-layer MLP update trajectories fed to a sparse linear probe outperform maximum softmax probability for uncertainty quantification under selective abstention, with gains up to 21 AURC points.
-
Contextualized Code Pretraining for Code Generation
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
-
SEED: Targeted Data Selection by Weighted Independent Set
SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods o...
-
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
-
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
-
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.
-
Causal Bias Detection in Generative Artificial Intelligence
A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
-
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
-
Training continuously-coupled reconfigurable photonic chips with quantum machine learning
A black-box machine learning technique trains continuously-coupled photonic waveguide arrays to implement target unitaries using limited single- and two-photon measurements without requiring detailed internal models.
-
Predicting Large Model Test Losses with a Noisy Quadratic System
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
-
DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing
DSPE is an edge processor that achieves 109.4 TFLOPS/W for DeepSeek inference using Merkle tree-based incremental pruning, multi-stage boothing lookup, and dynamic adaptive posit processing.
-
RELO: Reinforcement Learning to Localize for Visual Object Tracking
RELO formulates visual object tracking localization as a Markov decision process solved by reinforcement learning with combined IoU and AUC rewards, augmented by layer-aligned temporal token propagation, and reports 5...
-
RELO: Reinforcement Learning to Localize for Visual Object Tracking
RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
-
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...
-
Rethinking LLM Ensembling from the Perspective of Mixture Models
ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.
-
ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs
ReaGeo is an end-to-end LLM framework for geocoding that uses geohash text generation, Chain-of-Thought spatial reasoning, and distance-based RL to accurately predict points and regions from explicit and vague queries.
-
Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling
AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbi...
-
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
-
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
-
AFGNN: API Misuse Detection using Graph Neural Networks and Clustering
AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.
-
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
RAT+ pretrains a dense recurrent-augmented attention model once and enables flexible switching to dilated or hybrid sparse attention at inference after short adaptation, with small accuracy loss at high dilation factors.
-
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...
-
SWaRL: Safeguard Code Watermarking via Reinforcement Learning
SWaRL trains code LLMs with RL using compiler correctness signals and a confidential verifier reward to embed robust, functionality-preserving watermarks that resist refactoring attacks.
-
Foundation Models for Discovery and Exploration in Chemical Space
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
-
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs
MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
-
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
Extracting memorized pieces of (copyrighted) books from open-weight language models
A new extraction technique applied to 200 books and 14 LLMs finds that memorization of full books is rare except in specific high-capacity models where entire texts can be recovered verbatim.
-
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource...
-
A Study of LLMs' Preferences for Libraries and Programming Languages
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
-
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.
-
Optimization Hyper-parameter Laws for Large Language Models
Opt-Laws predicts LLM final training loss from LR schedules via SDE-derived convergence and escape features, with 94% Top-2 hit rate on held-out schedules and F1=0.92 for divergence detection.
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
-
Adversarial Reframing: A Framework for Targeted Generation in Language Models
THREAT uses coordinated LLMs in an iterative optimization loop to generate jailbreak prompts that achieve higher success rates and lower detection rates than previous methods across tested models and datasets.
Reference graph
Works this paper leans on
-
[2]
Anthropic. Introducing Claude , 2023. URL https://www.anthropic.com/index/introducing-claude
work page 2023
-
[6]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...
work page 2020
-
[7]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [10]
-
[12]
T. Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023
work page 2023
-
[13]
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[14]
Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022
work page 2022
-
[16]
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile : An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[17]
An important next step on our AI journey, 2023
Google. An important next step on our AI journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/
work page 2023
-
[24]
High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm
work page 2023
-
[26]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval : A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023
-
[27]
Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019
Huggingface Team . Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019. URL https://github.com/huggingface/tokenizers
work page 2019
-
[28]
F. i, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR3wGCk-IXp
work page 2023
- [29]
-
[33]
V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023
work page 2023
-
[35]
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[37]
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU : Measuring massive multitask language understanding in Chinese . arXiv preprint arXiv:2306.09212, 2023
work page internal anchor Pith review arXiv 2023
-
[38]
W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021
work page 2021
-
[43]
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018
work page 2018
-
[44]
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021
work page 2021
-
[45]
OpenAI. Introducing ChatGPT , 2022. URL https://openai.com/blog/chatgpt
work page 2022
-
[46]
OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [47]
-
[49]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[50]
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. 2023
work page 2023
-
[51]
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020
work page 2020
-
[52]
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019
work page 2019
-
[53]
C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20 0 (112): 0 1--49, 2019
work page 2019
-
[58]
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024
work page 2024
-
[59]
K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019
work page 2019
-
[63]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[65]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
work page 2022
-
[66]
T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023
work page 2023
-
[68]
A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan, F. Yang, F. Deng, F. Wang, F. Liu, G. Ai, G. Dong, H. Zhao, H. Xu, H. Sun, H. Zhang, H. Liu, J. Ji, J. Xie, J. Dai, K. Fang, L. Su, L. Song, L. Liu, L. Ru, L. Ma, M. Wang, M. Liu, M. Lin, N. Nie, P. Guo, R. Sun, T. Zhang, T. Li, T. Li, W. Cheng, W. Chen, X. Zeng, X. Wang, X. Chen...
work page 2023
-
[71]
B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
- [72]
-
[74]
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023
work page 2023
-
[77]
The Eleventh International Conference on Learning Representations,
Freda i and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =
work page 2023
-
[78]
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
work page 2024
-
[79]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=
work page internal anchor Pith review arXiv
-
[80]
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=
work page internal anchor Pith review arXiv
-
[81]
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.17452 , eprinttype =. 2309.17452 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2309.17452 2023
-
[82]
Wenhu Chen and Xueguang Ma and Xinyi Wang and William W. Cohen , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2211.12588 , eprinttype =. 2211.12588 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2211.12588 2022
-
[83]
International Conference on Machine Learning,
Luyu Gao and Aman Madaan and Shuyan Zhou and Uri Alon and Pengfei Liu and Yiming Yang and Jamie Callan and Graham Neubig , editor =. International Conference on Machine Learning,. 2023 , url =
work page 2023
-
[84]
Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =
-
[85]
Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan , editor =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.392 , timestamp =
-
[86]
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
Xiang Yue and Xingwei Qu and Ge Zhang and Yao Fu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.05653 , eprinttype =. 2309.05653 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2309.05653 2023
-
[87]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Longhui Yu and Weisen Jiang and Han Shi and Jincheng Yu and Zhengying Liu and Yu Zhang and James T. Kwok and Zhenguo Li and Adrian Weller and Weiyang Liu , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.12284 , eprinttype =. 2309.12284 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2309.12284 2023
-
[88]
TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147
- [89]
-
[90]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
- [91]
-
[92]
HAI-LLM: 高效且轻量的大模型训练工具 , author =
-
[93]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[94]
Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=
-
[95]
Proceedings of Machine Learning and Systems , volume=
Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=
-
[96]
and Ermon, Stefano and Rudra, Atri and R
Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=
-
[97]
Dao, Tri , year=. Flash
-
[98]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[99]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[100]
Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=
work page 2020
-
[101]
CCPM: A Chinese Classical Poetry Matching Dataset , author=. 2021 , eprint=
work page 2021
-
[102]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. 2018 , eprint=
work page 2018
- [103]
- [104]
-
[105]
Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension , author=. 2019 , eprint=
work page 2019
-
[106]
A Span-Extraction Dataset for C hinese Machine Reading Comprehension
Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping. A Span-Extraction Dataset for C hinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (E...
-
[107]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=
work page 2019
-
[108]
CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? , author=. 2023 , eprint=
work page 2023
-
[109]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[110]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=
work page internal anchor Pith review arXiv
-
[111]
Program Synthesis with Large Language Models
Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[112]
Proceedings of the 28th International Conference on Computational Linguistics,
Liang Xu and Hai Hu and Xuanwei Zhang and Lu Li and Chenjie Cao and Yudong Li and Yechen Xu and Kai Sun and Dian Yu and Cong Yu and Yin Tian and Qianqian Dong and Weitang Liu and Bo Shi and Yiming Cui and Junyi Li and Jun Zeng and Rongzhao Wang and Weijian Xie and Yanting Li and Yina Patterson and Zuoyu Tian and Yiwen Zhang and He Zhou and Shaoweihua Liu ...
-
[113]
Li, Haonan and Zhang, Yixuan and Koto, Fajri and Yang, Yifei and Zhao, Hai and Gong, Yeyun and Duan, Nan and Baldwin, Timothy , journal=
-
[114]
Chujie Zheng and Minlie Huang and Aixin Sun , editor =. ChID:. Proceedings of the 57th Conference of the Association for Computational Linguistics,. 2019 , url =. doi:10.18653/V1/P19-1075 , timestamp =
-
[115]
Guokun Lai and Qizhe Xie and Hanxiao Liu and Yiming Yang and Eduard H. Hovy , editor =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,. 2017 , url =. doi:10.18653/V1/D17-1082 , timestamp =
-
[116]
Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M
Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner , editor =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =. doi:10.18653/V1/N19-1246 , timestamp =
-
[117]
Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and others , journal=
-
[118]
LLaMA: Open and Efficient Foundation Language Models
Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[119]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton. Llama 2: Open Foundation and Fine-Tuned Chat Models , journal =. 2023 , url =. doi:10.48550/arXiv.2307.09288 , eprinttype =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.