pith. machine review for the scientific record. sign in

arxiv: 2401.02954 · v1 · submitted 2024-01-05 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Authors on Pith no claims yet

Pith reviewed 2026-05-11 06:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords scaling lawslarge language modelsopen-source modelspre-training datasetsupervised fine-tuningdirect preference optimizationbenchmark evaluationmodel performance
0
0 comments X

The pith

DeepSeek LLM 67B surpasses LLaMA-2 70B on code, mathematics and reasoning benchmarks, with its chat version exceeding GPT-3.5 in open-ended evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines scaling laws for large language models and identifies patterns that support effective training at the common open-source sizes of 7 billion and 67 billion parameters. A dataset beginning at 2 trillion tokens and designed to keep growing is used to pre-train the DeepSeek LLM base models. Supervised fine-tuning followed by direct preference optimization then produces chat versions whose performance exceeds that of LLaMA-2 70B on standard benchmarks, especially in code, mathematics and reasoning, while the 67B chat model also outperforms GPT-3.5 in open-ended tests. A sympathetic reader would care because the work shows how open projects can pursue steady, long-horizon scaling to narrow the gap with proprietary systems using publicly described methods and data growth.

Core claim

Guided by our distinctive findings on scaling laws, we train DeepSeek LLM base models in 7B and 67B configurations on a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further apply supervised fine-tuning and direct preference optimization to produce DeepSeek Chat models. Evaluation shows that DeepSeek LLM 67B surpasses LLaMA-2 70B across various benchmarks with particular strength in code, mathematics and reasoning, while open-ended evaluations indicate that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

What carries the argument

Distinctive findings on scaling laws that guide effective training in 7B and 67B sizes, implemented through a continuously expanding 2 trillion token dataset plus supervised fine-tuning and direct preference optimization.

If this is right

  • DeepSeek LLM 67B records higher scores than LLaMA-2 70B on standard benchmarks, especially those involving code, mathematics and reasoning.
  • The 67B chat model achieves better results than GPT-3.5 when evaluated on open-ended tasks.
  • The same scaling approach with ongoing data growth can be applied to produce further improvements in open-source models at these sizes.
  • Long-term expansion of the training dataset supports continued progress without requiring changes to the core training configuration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the identified scaling patterns persist, further growth of the token dataset beyond the current 2 trillion could yield additional performance lifts in the same model sizes.
  • Open projects following this data-first, long-horizon route may gradually close capability gaps with closed models on reasoning-heavy tasks.
  • Re-running the comparisons on entirely new benchmark suites would test whether the observed advantages generalize beyond the reported set.
  • The emphasis on sustained data collection could encourage similar multi-year efforts in other open-source language-model initiatives.

Load-bearing premise

The selected benchmarks and open-ended evaluations measure genuine model capability without undisclosed overlap in training data or advantages in methodology.

What would settle it

Independent re-testing on a fresh set of benchmarks withheld from the original evaluation that shows DeepSeek LLM 67B no longer outperforming LLaMA-2 70B or its chat version no longer exceeding GPT-3.5.

read the original abstract

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces DeepSeek LLM, an open-source project focused on long-term scaling of LLMs. It reports empirical studies of scaling laws for 7B and 67B models, describes pre-training a base model on a 2-trillion-token dataset that continues to grow, and applies supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) to create chat variants. The central empirical claims are that DeepSeek LLM 67B outperforms LLaMA-2 70B on code, mathematics, and reasoning benchmarks, and that the 67B Chat model shows superior performance to GPT-3.5 in open-ended evaluations.

Significance. If the benchmark results hold under scrutiny, the work is significant for advancing reproducible open-source LLMs by releasing competitive 67B-scale models trained with explicit long-term data scaling. The inclusion of scaling-law experiments, dataset construction details, and decontamination protocols in the methods section provides a useful reference for the community and supports the reported performance deltas.

major comments (2)
  1. [Evaluation] Open-ended evaluation section: the claim that DeepSeek LLM 67B Chat exhibits superior performance to GPT-3.5 rests on unspecified details of the evaluation protocol (prompting strategy, judge model or human raters, and any agreement metrics). Without these, the result cannot be independently verified and is load-bearing for the chat-model contribution.
  2. [§5] Benchmark results (tables in §5): while decontamination steps are described, the paper does not report the fraction of test-set overlap removed or provide before/after scores; this leaves open the possibility that domain-specific gains (code/math) partly reflect data leakage rather than model capability.
minor comments (3)
  1. [Scaling Laws] Figure captions for scaling curves should explicitly list the fitted exponents and any confidence intervals; current plots are difficult to reproduce from the text alone.
  2. [Abstract] The abstract uses 'longtermism' without definition; a one-sentence gloss would improve accessibility for readers outside the immediate subfield.
  3. [Evaluation] Several benchmark tables lack standard deviations or number of runs; adding these would strengthen the statistical interpretation of the reported deltas over LLaMA-2.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and positive overall assessment. We address each major comment below, indicating where revisions will be made to enhance verifiability and transparency.

read point-by-point responses
  1. Referee: [Evaluation] Open-ended evaluation section: the claim that DeepSeek LLM 67B Chat exhibits superior performance to GPT-3.5 rests on unspecified details of the evaluation protocol (prompting strategy, judge model or human raters, and any agreement metrics). Without these, the result cannot be independently verified and is load-bearing for the chat-model contribution.

    Authors: We agree that full specification of the evaluation protocol is necessary for independent verification of the open-ended results. In the revised manuscript we will expand the relevant section to detail the prompting strategy, the judge model employed, the involvement of human raters (if any), and quantitative agreement metrics such as inter-rater reliability scores. These additions will directly support the claim of superior performance relative to GPT-3.5. revision: yes

  2. Referee: [§5] Benchmark results (tables in §5): while decontamination steps are described, the paper does not report the fraction of test-set overlap removed or provide before/after scores; this leaves open the possibility that domain-specific gains (code/math) partly reflect data leakage rather than model capability.

    Authors: We acknowledge the value of quantifying the decontamination impact. We will revise the methods and results sections to report the fraction of test-set overlap removed for each benchmark category. However, providing complete before/after benchmark scores would require retraining the 67B model on the full 2-trillion-token corpus without decontamination, which is computationally prohibitive. We will instead clarify that the described decontamination procedure was applied uniformly and that performance advantages appear consistently across diverse benchmarks. revision: partial

standing simulated objections not resolved
  • Provision of before/after benchmark scores comparing models trained with and without decontamination, due to the prohibitive computational cost of retraining at 2-trillion-token scale.

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results are self-contained

full rationale

The paper's core claims consist of observed performance deltas on external benchmarks (code, math, reasoning, open-ended chat) after training a 67B model on an expanding 2T-token corpus followed by SFT+DPO. Scaling-law experiments are described as guiding dataset and model choices but do not reduce any reported result to a fitted parameter renamed as a prediction; the evaluation protocols, decontamination steps, and few-shot settings are stated explicitly and independently of the final scores. No self-definitional equations, load-bearing self-citations, or ansatz smuggling appear in the derivation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on empirical training runs and benchmark comparisons rather than new theoretical derivations; the main unstated inputs are standard assumptions about scaling laws and the effectiveness of SFT/DPO.

free parameters (2)
  • model scale
    7B and 67B sizes selected after scaling-law study
  • pre-training data volume
    2 trillion tokens assembled for the reported runs
axioms (2)
  • domain assumption Scaling laws reliably predict performance gains with increased model size and data
    Paper states it delved into scaling laws to guide the 7B/67B choices
  • domain assumption SFT followed by DPO produces aligned chat models that generalize on benchmarks
    Used to create the Chat variants whose superiority is claimed

pith-pipeline@v0.9.0 · 5848 in / 1451 out tokens · 62866 ms · 2026-05-11T06:03:01.583359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 44 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  2. IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

    cs.CL 2026-05 unverdicted novelty 7.0

    A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.

  3. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

    cs.AR 2026-03 unverdicted novelty 7.0

    ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

  4. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  5. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  6. Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

    cs.AI 2026-05 unverdicted novelty 6.0

    MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.

  7. PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

    cs.CL 2026-05 unverdicted novelty 6.0

    PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.

  8. SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

    cs.CL 2026-05 unverdicted novelty 6.0

    SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.

  9. Causal Bias Detection in Generative Artifical Intelligence

    cs.AI 2026-05 unverdicted novelty 6.0

    A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.

  10. Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

    cs.AI 2026-05 unverdicted novelty 6.0

    Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

  11. Training continuously-coupled reconfigurable photonic chips with quantum machine learning

    quant-ph 2026-05 unverdicted novelty 6.0

    A black-box machine learning technique trains continuously-coupled photonic waveguide arrays to implement target unitaries using limited single- and two-photon measurements without requiring detailed internal models.

  12. Predicting Large Model Test Losses with a Noisy Quadratic System

    cs.LG 2026-05 unverdicted novelty 6.0

    A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.

  13. DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing

    cs.AR 2026-05 unverdicted novelty 6.0

    DSPE is an edge processor that achieves 109.4 TFLOPS/W for DeepSeek inference using Merkle tree-based incremental pruning, multi-stage boothing lookup, and dynamic adaptive posit processing.

  14. RELO: Reinforcement Learning to Localize for Visual Object Tracking

    cs.CV 2026-05 unverdicted novelty 6.0

    RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.

  15. Why Does Agentic Safety Fail to Generalize Across Tasks?

    cs.LG 2026-05 conditional novelty 6.0

    Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...

  16. InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

    cs.CL 2026-05 unverdicted novelty 6.0

    InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...

  17. Rethinking LLM Ensembling from the Perspective of Mixture Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.

  18. ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

    cs.AI 2026-04 unverdicted novelty 6.0

    ReaGeo is an end-to-end LLM framework for geocoding that uses geohash text generation, Chain-of-Thought spatial reasoning, and distance-based RL to accurately predict points and regions from explicit and vague queries.

  19. Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbi...

  20. Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

    cs.AI 2026-04 unverdicted novelty 6.0

    Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.

  21. Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.

  22. AFGNN: API Misuse Detection using Graph Neural Networks and Clustering

    cs.SE 2026-04 unverdicted novelty 6.0

    AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.

  23. Muon is Scalable for LLM Training

    cs.LG 2025-02 unverdicted novelty 6.0

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  24. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    cs.CL 2024-04 conditional novelty 6.0

    MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.

  25. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  26. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  27. Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

    cs.LG 2026-04 unverdicted novelty 5.0

    Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.

  28. Why Do Vision Language Models Struggle To Recognize Human Emotions?

    cs.CV 2026-04 unverdicted novelty 5.0

    VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using l...

  29. Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation

    cs.CV 2026-04 unverdicted novelty 5.0

    A latent diffusion model conditioned on line drawings estimates dense depth to reconstruct 3D wireframes, reporting 5.3% average depth error after training on over one million pairs.

  30. The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability

    cs.SE 2026-04 unverdicted novelty 5.0

    The Cognitive Circuit Breaker detects LLM hallucinations by computing the Cognitive Dissonance Delta between semantic confidence and latent certainty from hidden states, adding negligible overhead.

  31. RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement

    cs.CR 2026-04 unverdicted novelty 5.0

    RefineRAG achieves 90% attack success on NQ by generating toxic seeds then optimizing them via retriever-in-the-loop word refinement, outperforming prior methods on effectiveness and naturalness.

  32. Tug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations

    cs.CL 2026-03 unverdicted novelty 5.0

    CRVA-TGRAG combines parent-document segmentation, ensemble retrieval, and teacher-guided fine-tuning to mitigate knowledge conflicts and improve accuracy in LLM-based CVE vulnerability analysis.

  33. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  34. DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    cs.SE 2024-01 unverdicted novelty 5.0

    DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.

  35. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    cs.CL 2024-01 unverdicted novelty 5.0

    DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.

  36. Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

    cs.LG 2026-05 unverdicted novelty 4.0

    Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.

  37. Agentic Application in Power Grid Static Analysis: Automatic Code Generation and Error Correction

    eess.SY 2026-04 unverdicted novelty 4.0

    An LLM agent with static pre-check, dynamic feedback, and semantic validation generates MATPOWER code from natural language for power grid analysis at 82.38% fidelity.

  38. Identifying Topological Invariants of Non-Hermitian Systems via Domain-Adaptive Multimodal Model for Mathematics

    cond-mat.other 2026-04 unverdicted novelty 4.0

    A multimodal model with Qwen Math backbone identifies topological invariants of non-Hermitian systems from eigenvalues and eigenvectors in momentum space.

  39. Data Mixing for Large Language Models Pretraining: A Survey and Outlook

    cs.CL 2026-03 accept novelty 4.0

    A survey that taxonomizes data mixing strategies for LLM pretraining into static rule-based, learning-based, and dynamic adaptive families while highlighting transferability challenges and evaluation gaps.

  40. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  41. DeepSeek-VL: Towards Real-World Vision-Language Understanding

    cs.AI 2024-03 unverdicted novelty 4.0

    DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...

  42. TinyLlama: An Open-Source Small Language Model

    cs.CL 2024-01 accept novelty 4.0

    TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.

  43. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

  44. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

128 extracted references · 128 canonical work pages · cited by 44 Pith papers · 31 internal anchors

  1. [2]

    Introducing Claude , 2023

    Anthropic. Introducing Claude , 2023. URL https://www.anthropic.com/index/introducing-claude

  2. [6]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...

  3. [7]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

  4. [10]

    Computer

    T. Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data

  5. [12]

    T. Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023

  6. [13]

    T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022

  7. [14]

    Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022

  8. [16]

    L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile : An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  9. [17]

    An important next step on our AI journey, 2023

    Google. An important next step on our AI journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/

  10. [24]

    Hai-llm: 高效且轻量的大模型训练工具, 2023

    High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

  11. [26]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval : A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023

  12. [27]

    Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019

    Huggingface Team . Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019. URL https://github.com/huggingface/tokenizers

  13. [28]

    F. i, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR3wGCk-IXp

  14. [29]

    Ivison, Y

    H. Ivison, Y. Wang, V. Pyatkin, N. Lambert, M. Peters, P. Dasigi, J. Jang, D. Wadden, N. A. Smith, I. Beltagy, and H. Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with tulu 2. 2023

  15. [33]

    V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023

  16. [35]

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  17. [37]

    H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU : Measuring massive multitask language understanding in Chinese . arXiv preprint arXiv:2306.09212, 2023

  18. [38]

    W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021

  19. [43]

    Mihaylov, P

    T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018

  20. [44]

    Narayanan, M

    D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021

  21. [45]

    Introducing ChatGPT , 2022

    OpenAI. Introducing ChatGPT , 2022. URL https://openai.com/blog/chatgpt

  22. [46]

    GPT-4 Technical Report

    OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023

  23. [47]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  24. [49]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  25. [50]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. 2023

  26. [51]

    Rajbhandari, J

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020

  27. [52]

    Sakaguchi, R

    K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019

  28. [53]

    C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20 0 (112): 0 1--49, 2019

  29. [58]

    J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

  30. [59]

    K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019

  31. [63]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  32. [65]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

  33. [66]

    T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023

  34. [68]

    A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan, F. Yang, F. Deng, F. Wang, F. Liu, G. Ai, G. Dong, H. Zhao, H. Xu, H. Sun, H. Zhang, H. Liu, J. Ji, J. Xie, J. Dai, K. Fang, L. Su, L. Song, L. Liu, L. Ru, L. Ma, M. Wang, M. Liu, M. Lin, N. Nie, P. Guo, R. Sun, T. Zhang, T. Li, T. Li, W. Cheng, W. Chen, X. Zeng, X. Wang, X. Chen...

  35. [71]

    Zhang and R

    B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

  36. [72]

    Zhang, L

    G. Zhang, L. Li, Z. Nado, J. Martens, S. Sachdeva, G. Dahl, C. Shallue, and R. B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019

  37. [74]

    Zheng, W.-L

    L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023

  38. [77]

    The Eleventh International Conference on Learning Representations,

    Freda i and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  39. [78]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  40. [79]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

  41. [80]

    Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

    Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=

  42. [81]

    Tora: A tool-integrated reasoning agent for mathematical problem solving

    Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.17452 , eprinttype =. 2309.17452 , timestamp =

  43. [82]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Wenhu Chen and Xueguang Ma and Xinyi Wang and William W. Cohen , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2211.12588 , eprinttype =. 2211.12588 , timestamp =

  44. [83]

    International Conference on Machine Learning,

    Luyu Gao and Aman Madaan and Shuyan Zhou and Uri Alon and Pengfei Liu and Yiming Yang and Jamie Callan and Graham Neubig , editor =. International Conference on Machine Learning,. 2023 , url =

  45. [84]

    Chi and Quoc V

    Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =

  46. [85]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,

    Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan , editor =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.392 , timestamp =

  47. [86]

    arXiv preprint arXiv:2309.05653 , year=

    Xiang Yue and Xingwei Qu and Ge Zhang and Yao Fu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.05653 , eprinttype =. 2309.05653 , timestamp =

  48. [87]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Longhui Yu and Weisen Jiang and Han Shi and Jincheng Yu and Zhengying Liu and Yu Zhang and James T. Kwok and Zhenguo Li and Adrian Weller and Weiyang Liu , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.12284 , eprinttype =. 2309.12284 , timestamp =

  49. [88]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

  50. [89]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  51. [90]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  52. [91]

    Introducing

    OpenAI , url =. Introducing

  53. [92]

    HAI-LLM: 高效且轻量的大模型训练工具 , author =

  54. [93]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

  55. [94]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

  56. [95]

    Proceedings of Machine Learning and Systems , volume=

    Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=

  57. [96]

    and Ermon, Stefano and Rudra, Atri and R

    Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=

  58. [97]

    Dao, Tri , year=. Flash

  59. [98]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  60. [99]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  61. [100]

    SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

  62. [101]

    2021 , eprint=

    CCPM: A Chinese Classical Poetry Matching Dataset , author=. 2021 , eprint=

  63. [102]

    2018 , eprint=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. 2018 , eprint=

  64. [103]

    Introducing

    Anthropic , institution =. Introducing

  65. [104]

    An important next step on our

    Google , url =. An important next step on our

  66. [105]

    2019 , eprint=

    Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension , author=. 2019 , eprint=

  67. [106]

    A Span-Extraction Dataset for C hinese Machine Reading Comprehension

    Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping. A Span-Extraction Dataset for C hinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (E...

  68. [107]

    2019 , eprint=

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

  69. [108]

    2023 , eprint=

    CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? , author=. 2023 , eprint=

  70. [109]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  71. [110]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

  72. [111]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

  73. [112]

    Proceedings of the 28th International Conference on Computational Linguistics,

    Liang Xu and Hai Hu and Xuanwei Zhang and Lu Li and Chenjie Cao and Yudong Li and Yechen Xu and Kai Sun and Dian Yu and Cong Yu and Yin Tian and Qianqian Dong and Weitang Liu and Bo Shi and Yiming Cui and Junyi Li and Jun Zeng and Rongzhao Wang and Weijian Xie and Yanting Li and Yina Patterson and Zuoyu Tian and Yiwen Zhang and He Zhou and Shaoweihua Liu ...

  74. [113]

    Li, Haonan and Zhang, Yixuan and Koto, Fajri and Yang, Yifei and Zhao, Hai and Gong, Yeyun and Duan, Nan and Baldwin, Timothy , journal=

  75. [114]

    Chujie Zheng and Minlie Huang and Aixin Sun , editor =. ChID:. Proceedings of the 57th Conference of the Association for Computational Linguistics,. 2019 , url =. doi:10.18653/V1/P19-1075 , timestamp =

  76. [115]

    RACE : Large-scale R e A ding comprehension dataset from examinations

    Guokun Lai and Qizhe Xie and Hanxiao Liu and Yiming Yang and Eduard H. Hovy , editor =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,. 2017 , url =. doi:10.18653/V1/D17-1082 , timestamp =

  77. [116]

    doi:10.18653/v1/N19-1246 , editor =

    Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner , editor =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =. doi:10.18653/V1/N19-1246 , timestamp =

  78. [117]

    Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and others , journal=

  79. [118]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

  80. [119]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton. Llama 2: Open Foundation and Fine-Tuned Chat Models , journal =. 2023 , url =. doi:10.48550/arXiv.2307.09288 , eprinttype =

Showing first 80 references.