Recognition: 3 theorem links
· Lean TheoremDeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Pith reviewed 2026-05-11 06:03 UTC · model grok-4.3
The pith
DeepSeek LLM 67B surpasses LLaMA-2 70B on code, mathematics and reasoning benchmarks, with its chat version exceeding GPT-3.5 in open-ended evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Guided by our distinctive findings on scaling laws, we train DeepSeek LLM base models in 7B and 67B configurations on a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further apply supervised fine-tuning and direct preference optimization to produce DeepSeek Chat models. Evaluation shows that DeepSeek LLM 67B surpasses LLaMA-2 70B across various benchmarks with particular strength in code, mathematics and reasoning, while open-ended evaluations indicate that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.
What carries the argument
Distinctive findings on scaling laws that guide effective training in 7B and 67B sizes, implemented through a continuously expanding 2 trillion token dataset plus supervised fine-tuning and direct preference optimization.
If this is right
- DeepSeek LLM 67B records higher scores than LLaMA-2 70B on standard benchmarks, especially those involving code, mathematics and reasoning.
- The 67B chat model achieves better results than GPT-3.5 when evaluated on open-ended tasks.
- The same scaling approach with ongoing data growth can be applied to produce further improvements in open-source models at these sizes.
- Long-term expansion of the training dataset supports continued progress without requiring changes to the core training configuration.
Where Pith is reading between the lines
- If the identified scaling patterns persist, further growth of the token dataset beyond the current 2 trillion could yield additional performance lifts in the same model sizes.
- Open projects following this data-first, long-horizon route may gradually close capability gaps with closed models on reasoning-heavy tasks.
- Re-running the comparisons on entirely new benchmark suites would test whether the observed advantages generalize beyond the reported set.
- The emphasis on sustained data collection could encourage similar multi-year efforts in other open-source language-model initiatives.
Load-bearing premise
The selected benchmarks and open-ended evaluations measure genuine model capability without undisclosed overlap in training data or advantages in methodology.
What would settle it
Independent re-testing on a fresh set of benchmarks withheld from the original evaluation that shows DeepSeek LLM 67B no longer outperforming LLaMA-2 70B or its chat version no longer exceeding GPT-3.5.
read the original abstract
The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepSeek LLM, an open-source project focused on long-term scaling of LLMs. It reports empirical studies of scaling laws for 7B and 67B models, describes pre-training a base model on a 2-trillion-token dataset that continues to grow, and applies supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) to create chat variants. The central empirical claims are that DeepSeek LLM 67B outperforms LLaMA-2 70B on code, mathematics, and reasoning benchmarks, and that the 67B Chat model shows superior performance to GPT-3.5 in open-ended evaluations.
Significance. If the benchmark results hold under scrutiny, the work is significant for advancing reproducible open-source LLMs by releasing competitive 67B-scale models trained with explicit long-term data scaling. The inclusion of scaling-law experiments, dataset construction details, and decontamination protocols in the methods section provides a useful reference for the community and supports the reported performance deltas.
major comments (2)
- [Evaluation] Open-ended evaluation section: the claim that DeepSeek LLM 67B Chat exhibits superior performance to GPT-3.5 rests on unspecified details of the evaluation protocol (prompting strategy, judge model or human raters, and any agreement metrics). Without these, the result cannot be independently verified and is load-bearing for the chat-model contribution.
- [§5] Benchmark results (tables in §5): while decontamination steps are described, the paper does not report the fraction of test-set overlap removed or provide before/after scores; this leaves open the possibility that domain-specific gains (code/math) partly reflect data leakage rather than model capability.
minor comments (3)
- [Scaling Laws] Figure captions for scaling curves should explicitly list the fitted exponents and any confidence intervals; current plots are difficult to reproduce from the text alone.
- [Abstract] The abstract uses 'longtermism' without definition; a one-sentence gloss would improve accessibility for readers outside the immediate subfield.
- [Evaluation] Several benchmark tables lack standard deviations or number of runs; adding these would strengthen the statistical interpretation of the reported deltas over LLaMA-2.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive overall assessment. We address each major comment below, indicating where revisions will be made to enhance verifiability and transparency.
read point-by-point responses
-
Referee: [Evaluation] Open-ended evaluation section: the claim that DeepSeek LLM 67B Chat exhibits superior performance to GPT-3.5 rests on unspecified details of the evaluation protocol (prompting strategy, judge model or human raters, and any agreement metrics). Without these, the result cannot be independently verified and is load-bearing for the chat-model contribution.
Authors: We agree that full specification of the evaluation protocol is necessary for independent verification of the open-ended results. In the revised manuscript we will expand the relevant section to detail the prompting strategy, the judge model employed, the involvement of human raters (if any), and quantitative agreement metrics such as inter-rater reliability scores. These additions will directly support the claim of superior performance relative to GPT-3.5. revision: yes
-
Referee: [§5] Benchmark results (tables in §5): while decontamination steps are described, the paper does not report the fraction of test-set overlap removed or provide before/after scores; this leaves open the possibility that domain-specific gains (code/math) partly reflect data leakage rather than model capability.
Authors: We acknowledge the value of quantifying the decontamination impact. We will revise the methods and results sections to report the fraction of test-set overlap removed for each benchmark category. However, providing complete before/after benchmark scores would require retraining the 67B model on the full 2-trillion-token corpus without decontamination, which is computationally prohibitive. We will instead clarify that the described decontamination procedure was applied uniformly and that performance advantages appear consistently across diverse benchmarks. revision: partial
- Provision of before/after benchmark scores comparing models trained with and without decontamination, due to the prohibitive computational cost of retraining at 2-trillion-token scale.
Circularity Check
No significant circularity; empirical benchmark results are self-contained
full rationale
The paper's core claims consist of observed performance deltas on external benchmarks (code, math, reasoning, open-ended chat) after training a 67B model on an expanding 2T-token corpus followed by SFT+DPO. Scaling-law experiments are described as guiding dataset and model choices but do not reduce any reported result to a fitted parameter renamed as a prediction; the evaluation protocols, decontamination steps, and few-shot settings are stated explicitly and independently of the final scores. No self-definitional equations, load-bearing self-citations, or ansatz smuggling appear in the derivation chain.
Axiom & Free-Parameter Ledger
free parameters (2)
- model scale
- pre-training data volume
axioms (2)
- domain assumption Scaling laws reliably predict performance gains with increased model size and data
- domain assumption SFT followed by DPO produces aligned chat models that generalize on benchmarks
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning.
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective.
-
IndisputableMonolith.Foundation.LedgerForcingconservation_from_balance unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 44 Pith papers
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
-
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
-
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.
-
Causal Bias Detection in Generative Artifical Intelligence
A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
-
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
-
Training continuously-coupled reconfigurable photonic chips with quantum machine learning
A black-box machine learning technique trains continuously-coupled photonic waveguide arrays to implement target unitaries using limited single- and two-photon measurements without requiring detailed internal models.
-
Predicting Large Model Test Losses with a Noisy Quadratic System
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
-
DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing
DSPE is an edge processor that achieves 109.4 TFLOPS/W for DeepSeek inference using Merkle tree-based incremental pruning, multi-stage boothing lookup, and dynamic adaptive posit processing.
-
RELO: Reinforcement Learning to Localize for Visual Object Tracking
RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
-
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...
-
Rethinking LLM Ensembling from the Perspective of Mixture Models
ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.
-
ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs
ReaGeo is an end-to-end LLM framework for geocoding that uses geohash text generation, Chain-of-Thought spatial reasoning, and distance-based RL to accurately predict points and regions from explicit and vague queries.
-
Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling
AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbi...
-
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
-
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
-
AFGNN: API Misuse Detection using Graph Neural Networks and Clustering
AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling
Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.
-
Why Do Vision Language Models Struggle To Recognize Human Emotions?
VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using l...
-
Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation
A latent diffusion model conditioned on line drawings estimates dense depth to reconstruct 3D wireframes, reporting 5.3% average depth error after training on over one million pairs.
-
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability
The Cognitive Circuit Breaker detects LLM hallucinations by computing the Cognitive Dissonance Delta between semantic confidence and latent certainty from hidden states, adding negligible overhead.
-
RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement
RefineRAG achieves 90% attack success on NQ by generating toxic seeds then optimizing them via retriever-in-the-loop word refinement, outperforming prior methods on effectiveness and naturalness.
-
Tug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations
CRVA-TGRAG combines parent-document segmentation, ensemble retrieval, and teacher-guided fine-tuning to mitigate knowledge conflicts and improve accuracy in LLM-based CVE vulnerability analysis.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
-
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
-
Agentic Application in Power Grid Static Analysis: Automatic Code Generation and Error Correction
An LLM agent with static pre-check, dynamic feedback, and semantic validation generates MATPOWER code from natural language for power grid analysis at 82.38% fidelity.
-
Identifying Topological Invariants of Non-Hermitian Systems via Domain-Adaptive Multimodal Model for Mathematics
A multimodal model with Qwen Math backbone identifies topological invariants of non-Hermitian systems from eigenvalues and eigenvectors in momentum space.
-
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
A survey that taxonomizes data mixing strategies for LLM pretraining into static rule-based, learning-based, and dynamic adaptive families while highlighting transferability challenges and evaluation gaps.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
-
TinyLlama: An Open-Source Small Language Model
TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Reference graph
Works this paper leans on
-
[2]
Anthropic. Introducing Claude , 2023. URL https://www.anthropic.com/index/introducing-claude
work page 2023
-
[6]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...
work page 2020
-
[7]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [10]
-
[12]
T. Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023
work page 2023
-
[13]
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[14]
Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022
work page 2022
-
[16]
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile : An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[17]
An important next step on our AI journey, 2023
Google. An important next step on our AI journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/
work page 2023
-
[24]
High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm
work page 2023
-
[26]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval : A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023
-
[27]
Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019
Huggingface Team . Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019. URL https://github.com/huggingface/tokenizers
work page 2019
-
[28]
F. i, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR3wGCk-IXp
work page 2023
- [29]
-
[33]
V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023
work page 2023
-
[35]
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
- [37]
-
[38]
W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021
work page 2021
-
[43]
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018
work page 2018
-
[44]
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021
work page 2021
-
[45]
OpenAI. Introducing ChatGPT , 2022. URL https://openai.com/blog/chatgpt
work page 2022
-
[46]
OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [47]
-
[49]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[50]
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. 2023
work page 2023
-
[51]
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020
work page 2020
-
[52]
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019
work page 2019
-
[53]
C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20 0 (112): 0 1--49, 2019
work page 2019
-
[58]
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024
work page 2024
-
[59]
K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019
work page 2019
-
[63]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[65]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
work page 2022
-
[66]
T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023
work page 2023
-
[68]
A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan, F. Yang, F. Deng, F. Wang, F. Liu, G. Ai, G. Dong, H. Zhao, H. Xu, H. Sun, H. Zhang, H. Liu, J. Ji, J. Xie, J. Dai, K. Fang, L. Su, L. Song, L. Liu, L. Ru, L. Ma, M. Wang, M. Liu, M. Lin, N. Nie, P. Guo, R. Sun, T. Zhang, T. Li, T. Li, W. Cheng, W. Chen, X. Zeng, X. Wang, X. Chen...
work page 2023
-
[71]
B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
- [72]
-
[74]
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023
work page 2023
-
[77]
The Eleventh International Conference on Learning Representations,
Freda i and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =
work page 2023
-
[78]
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
work page 2024
-
[79]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=
work page internal anchor Pith review arXiv
-
[80]
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=
-
[81]
Tora: A tool-integrated reasoning agent for mathematical problem solving
Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.17452 , eprinttype =. 2309.17452 , timestamp =
-
[82]
Wenhu Chen and Xueguang Ma and Xinyi Wang and William W. Cohen , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2211.12588 , eprinttype =. 2211.12588 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2211.12588 2022
-
[83]
International Conference on Machine Learning,
Luyu Gao and Aman Madaan and Shuyan Zhou and Uri Alon and Pengfei Liu and Yiming Yang and Jamie Callan and Graham Neubig , editor =. International Conference on Machine Learning,. 2023 , url =
work page 2023
-
[84]
Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =
-
[85]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,
Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan , editor =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.392 , timestamp =
-
[86]
arXiv preprint arXiv:2309.05653 , year=
Xiang Yue and Xingwei Qu and Ge Zhang and Yao Fu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.05653 , eprinttype =. 2309.05653 , timestamp =
-
[87]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Longhui Yu and Weisen Jiang and Han Shi and Jincheng Yu and Zhengying Liu and Yu Zhang and James T. Kwok and Zhenguo Li and Adrian Weller and Weiyang Liu , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.12284 , eprinttype =. 2309.12284 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2309.12284 2023
-
[88]
T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147
- [89]
-
[90]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
- [91]
-
[92]
HAI-LLM: 高效且轻量的大模型训练工具 , author =
-
[93]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[94]
Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=
-
[95]
Proceedings of Machine Learning and Systems , volume=
Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=
-
[96]
and Ermon, Stefano and Rudra, Atri and R
Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=
-
[97]
Dao, Tri , year=. Flash
-
[98]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[99]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[100]
Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=
work page 2020
-
[101]
CCPM: A Chinese Classical Poetry Matching Dataset , author=. 2021 , eprint=
work page 2021
-
[102]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. 2018 , eprint=
work page 2018
- [103]
- [104]
-
[105]
Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension , author=. 2019 , eprint=
work page 2019
-
[106]
A Span-Extraction Dataset for C hinese Machine Reading Comprehension
Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping. A Span-Extraction Dataset for C hinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (E...
-
[107]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=
work page 2019
-
[108]
CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? , author=. 2023 , eprint=
work page 2023
-
[109]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[110]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=
work page internal anchor Pith review arXiv
-
[111]
Program Synthesis with Large Language Models
Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[112]
Proceedings of the 28th International Conference on Computational Linguistics,
Liang Xu and Hai Hu and Xuanwei Zhang and Lu Li and Chenjie Cao and Yudong Li and Yechen Xu and Kai Sun and Dian Yu and Cong Yu and Yin Tian and Qianqian Dong and Weitang Liu and Bo Shi and Yiming Cui and Junyi Li and Jun Zeng and Rongzhao Wang and Weijian Xie and Yanting Li and Yina Patterson and Zuoyu Tian and Yiwen Zhang and He Zhou and Shaoweihua Liu ...
-
[113]
Li, Haonan and Zhang, Yixuan and Koto, Fajri and Yang, Yifei and Zhao, Hai and Gong, Yeyun and Duan, Nan and Baldwin, Timothy , journal=
-
[114]
Chujie Zheng and Minlie Huang and Aixin Sun , editor =. ChID:. Proceedings of the 57th Conference of the Association for Computational Linguistics,. 2019 , url =. doi:10.18653/V1/P19-1075 , timestamp =
-
[115]
RACE : Large-scale R e A ding comprehension dataset from examinations
Guokun Lai and Qizhe Xie and Hanxiao Liu and Yiming Yang and Eduard H. Hovy , editor =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,. 2017 , url =. doi:10.18653/V1/D17-1082 , timestamp =
-
[116]
doi:10.18653/v1/N19-1246 , editor =
Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner , editor =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =. doi:10.18653/V1/N19-1246 , timestamp =
-
[117]
Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and others , journal=
-
[118]
LLaMA: Open and Efficient Foundation Language Models
Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[119]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton. Llama 2: Open Foundation and Fine-Tuned Chat Models , journal =. 2023 , url =. doi:10.48550/arXiv.2307.09288 , eprinttype =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.