arxiv: 2401.02954 · v1 · submitted 2024-01-05 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI: Xiao Bi , Deli Chen , Guanting Chen , Shanhuang Chen , Damai Dai , Chengqi Deng , Honghui Ding , Kai Dong

show 78 more authors

Qiushi Du Zhe Fu Huazuo Gao Kaige Gao Wenjun Gao Ruiqi Ge Kang Guan Daya Guo Jianzhong Guo Guangbo Hao Zhewen Hao Ying He Wenjie Hu Panpan Huang Erhang Li Guowei Li Jiashi Li Yao Li Y.K. Li Wenfeng Liang Fangyun Lin A.X. Liu Bo Liu Wen Liu XiaoDong Liu Xin Liu Yiyuan Liu Haoyu Lu Shanghao Lu Fuli Luo Shirong Ma Xiaotao Nie Tian Pei Yishi Piao Junjie Qiu Hui Qu Tongzheng Ren Zehui Ren Chong Ruan Zhangli Sha Zhihong Shao Junxiao Song Xuecheng Su Jingxiang Sun Yaofeng Sun Minghui Tang Bingxuan Wang Peiyi Wang Shiyu Wang Yaohui Wang Yongji Wang Tong Wu Y. Wu Xin Xie Zhenda Xie Ziwei Xie Yiliang Xiong Hanwei Xu R.X. Xu Yanhong Xu Dejian Yang Yuxiang You Shuiping Yu Xingkai Yu B. Zhang Haowei Zhang Lecong Zhang Liyue Zhang Mingchuan Zhang Minghua Zhang Wentao Zhang Yichao Zhang Chenggang Zhao Yao Zhao Shangyan Zhou Shunfeng Zhou Qihao Zhu Yuheng Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 06:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords scaling lawslarge language modelsopen-source modelspre-training datasetsupervised fine-tuningdirect preference optimizationbenchmark evaluationmodel performance

0 comments

The pith

DeepSeek LLM 67B surpasses LLaMA-2 70B on code, mathematics and reasoning benchmarks, with its chat version exceeding GPT-3.5 in open-ended evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines scaling laws for large language models and identifies patterns that support effective training at the common open-source sizes of 7 billion and 67 billion parameters. A dataset beginning at 2 trillion tokens and designed to keep growing is used to pre-train the DeepSeek LLM base models. Supervised fine-tuning followed by direct preference optimization then produces chat versions whose performance exceeds that of LLaMA-2 70B on standard benchmarks, especially in code, mathematics and reasoning, while the 67B chat model also outperforms GPT-3.5 in open-ended tests. A sympathetic reader would care because the work shows how open projects can pursue steady, long-horizon scaling to narrow the gap with proprietary systems using publicly described methods and data growth.

Core claim

Guided by our distinctive findings on scaling laws, we train DeepSeek LLM base models in 7B and 67B configurations on a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further apply supervised fine-tuning and direct preference optimization to produce DeepSeek Chat models. Evaluation shows that DeepSeek LLM 67B surpasses LLaMA-2 70B across various benchmarks with particular strength in code, mathematics and reasoning, while open-ended evaluations indicate that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

What carries the argument

Distinctive findings on scaling laws that guide effective training in 7B and 67B sizes, implemented through a continuously expanding 2 trillion token dataset plus supervised fine-tuning and direct preference optimization.

If this is right

DeepSeek LLM 67B records higher scores than LLaMA-2 70B on standard benchmarks, especially those involving code, mathematics and reasoning.
The 67B chat model achieves better results than GPT-3.5 when evaluated on open-ended tasks.
The same scaling approach with ongoing data growth can be applied to produce further improvements in open-source models at these sizes.
Long-term expansion of the training dataset supports continued progress without requiring changes to the core training configuration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the identified scaling patterns persist, further growth of the token dataset beyond the current 2 trillion could yield additional performance lifts in the same model sizes.
Open projects following this data-first, long-horizon route may gradually close capability gaps with closed models on reasoning-heavy tasks.
Re-running the comparisons on entirely new benchmark suites would test whether the observed advantages generalize beyond the reported set.
The emphasis on sustained data collection could encourage similar multi-year efforts in other open-source language-model initiatives.

Load-bearing premise

The selected benchmarks and open-ended evaluations measure genuine model capability without undisclosed overlap in training data or advantages in methodology.

What would settle it

Independent re-testing on a fresh set of benchmarks withheld from the original evaluation that shows DeepSeek LLM 67B no longer outperforming LLaMA-2 70B or its chat version no longer exceeding GPT-3.5.

read the original abstract

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepSeek's 67B model is a worthwhile open-source release that delivers on the performance claims with transparent enough methods.

read the letter

The key thing with this DeepSeek LLM paper is their 67B base model outperforming LLaMA-2 70B on code, mathematics, and reasoning benchmarks, and the chat version doing better than GPT-3.5 in open-ended evaluations. They trained both 7B and 67B variants on a 2 trillion token dataset and used scaling law experiments to inform the process before doing SFT and DPO. What stands out is how they detail the dataset construction, the scaling-law fits for the two sizes, and the evaluation setup with few-shot prompts and decontamination in the full manuscript. The reported performance deltas are consistent with the model scale and data volume, and the scaling curves don't show any internal contradictions. The soft spots are limited. Their scaling law observations mostly line up with what's already in the literature rather than introducing a new paradigm, and the open-ended evaluations, while useful, can be sensitive to how the tests are run. More public ablations on data mixtures would strengthen it, but nothing looks load-bearing wrong. This work is aimed at researchers and developers working with open-source LLMs, especially those focused on code and reasoning tasks. It has clear thinking behind the training choices and reproducible elements in the methods, so it deserves to go through peer review for broader feedback and to highlight the model releases.

Referee Report

2 major / 3 minor

Summary. The paper introduces DeepSeek LLM, an open-source project focused on long-term scaling of LLMs. It reports empirical studies of scaling laws for 7B and 67B models, describes pre-training a base model on a 2-trillion-token dataset that continues to grow, and applies supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) to create chat variants. The central empirical claims are that DeepSeek LLM 67B outperforms LLaMA-2 70B on code, mathematics, and reasoning benchmarks, and that the 67B Chat model shows superior performance to GPT-3.5 in open-ended evaluations.

Significance. If the benchmark results hold under scrutiny, the work is significant for advancing reproducible open-source LLMs by releasing competitive 67B-scale models trained with explicit long-term data scaling. The inclusion of scaling-law experiments, dataset construction details, and decontamination protocols in the methods section provides a useful reference for the community and supports the reported performance deltas.

major comments (2)

[Evaluation] Open-ended evaluation section: the claim that DeepSeek LLM 67B Chat exhibits superior performance to GPT-3.5 rests on unspecified details of the evaluation protocol (prompting strategy, judge model or human raters, and any agreement metrics). Without these, the result cannot be independently verified and is load-bearing for the chat-model contribution.
[§5] Benchmark results (tables in §5): while decontamination steps are described, the paper does not report the fraction of test-set overlap removed or provide before/after scores; this leaves open the possibility that domain-specific gains (code/math) partly reflect data leakage rather than model capability.

minor comments (3)

[Scaling Laws] Figure captions for scaling curves should explicitly list the fitted exponents and any confidence intervals; current plots are difficult to reproduce from the text alone.
[Abstract] The abstract uses 'longtermism' without definition; a one-sentence gloss would improve accessibility for readers outside the immediate subfield.
[Evaluation] Several benchmark tables lack standard deviations or number of runs; adding these would strengthen the statistical interpretation of the reported deltas over LLaMA-2.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and positive overall assessment. We address each major comment below, indicating where revisions will be made to enhance verifiability and transparency.

read point-by-point responses

Referee: [Evaluation] Open-ended evaluation section: the claim that DeepSeek LLM 67B Chat exhibits superior performance to GPT-3.5 rests on unspecified details of the evaluation protocol (prompting strategy, judge model or human raters, and any agreement metrics). Without these, the result cannot be independently verified and is load-bearing for the chat-model contribution.

Authors: We agree that full specification of the evaluation protocol is necessary for independent verification of the open-ended results. In the revised manuscript we will expand the relevant section to detail the prompting strategy, the judge model employed, the involvement of human raters (if any), and quantitative agreement metrics such as inter-rater reliability scores. These additions will directly support the claim of superior performance relative to GPT-3.5. revision: yes
Referee: [§5] Benchmark results (tables in §5): while decontamination steps are described, the paper does not report the fraction of test-set overlap removed or provide before/after scores; this leaves open the possibility that domain-specific gains (code/math) partly reflect data leakage rather than model capability.

Authors: We acknowledge the value of quantifying the decontamination impact. We will revise the methods and results sections to report the fraction of test-set overlap removed for each benchmark category. However, providing complete before/after benchmark scores would require retraining the 67B model on the full 2-trillion-token corpus without decontamination, which is computationally prohibitive. We will instead clarify that the described decontamination procedure was applied uniformly and that performance advantages appear consistently across diverse benchmarks. revision: partial

standing simulated objections not resolved

Provision of before/after benchmark scores comparing models trained with and without decontamination, due to the prohibitive computational cost of retraining at 2-trillion-token scale.

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results are self-contained

full rationale

The paper's core claims consist of observed performance deltas on external benchmarks (code, math, reasoning, open-ended chat) after training a 67B model on an expanding 2T-token corpus followed by SFT+DPO. Scaling-law experiments are described as guiding dataset and model choices but do not reduce any reported result to a fitted parameter renamed as a prediction; the evaluation protocols, decontamination steps, and few-shot settings are stated explicitly and independently of the final scores. No self-definitional equations, load-bearing self-citations, or ansatz smuggling appear in the derivation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on empirical training runs and benchmark comparisons rather than new theoretical derivations; the main unstated inputs are standard assumptions about scaling laws and the effectiveness of SFT/DPO.

free parameters (2)

model scale
7B and 67B sizes selected after scaling-law study
pre-training data volume
2 trillion tokens assembled for the reported runs

axioms (2)

domain assumption Scaling laws reliably predict performance gains with increased model size and data
Paper states it delved into scaling laws to guide the 7B/67B choices
domain assumption SFT followed by DPO produces aligned chat models that generalize on benchmarks
Used to create the Chat variants whose superiority is claimed

pith-pipeline@v0.9.0 · 5848 in / 1451 out tokens · 62866 ms · 2026-05-11T06:03:01.583359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning.
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective.
IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 44 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
cs.CL 2026-05 unverdicted novelty 7.0

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
cs.AR 2026-03 unverdicted novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
cs.AI 2026-05 unverdicted novelty 6.0

MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.
Causal Bias Detection in Generative Artifical Intelligence
cs.AI 2026-05 unverdicted novelty 6.0

A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
cs.AI 2026-05 unverdicted novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
Training continuously-coupled reconfigurable photonic chips with quantum machine learning
quant-ph 2026-05 unverdicted novelty 6.0

A black-box machine learning technique trains continuously-coupled photonic waveguide arrays to implement target unitaries using limited single- and two-photon measurements without requiring detailed internal models.
Predicting Large Model Test Losses with a Noisy Quadratic System
cs.LG 2026-05 unverdicted novelty 6.0

A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing
cs.AR 2026-05 unverdicted novelty 6.0

DSPE is an edge processor that achieves 109.4 TFLOPS/W for DeepSeek inference using Merkle tree-based incremental pruning, multi-stage boothing lookup, and dynamic adaptive posit processing.
RELO: Reinforcement Learning to Localize for Visual Object Tracking
cs.CV 2026-05 unverdicted novelty 6.0

RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
cs.CL 2026-05 unverdicted novelty 6.0

InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...
Rethinking LLM Ensembling from the Perspective of Mixture Models
cs.LG 2026-05 unverdicted novelty 6.0

ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.
ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs
cs.AI 2026-04 unverdicted novelty 6.0

ReaGeo is an end-to-end LLM framework for geocoding that uses geohash text generation, Chain-of-Thought spatial reasoning, and distance-based RL to accurately predict points and regions from explicit and vague queries.
Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling
cs.LG 2026-04 unverdicted novelty 6.0

AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbi...
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
cs.AI 2026-04 unverdicted novelty 6.0

Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
AFGNN: API Misuse Detection using Graph Neural Networks and Clustering
cs.SE 2026-04 unverdicted novelty 6.0

AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling
cs.LG 2026-04 unverdicted novelty 5.0

Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.
Why Do Vision Language Models Struggle To Recognize Human Emotions?
cs.CV 2026-04 unverdicted novelty 5.0

VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using l...
Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation
cs.CV 2026-04 unverdicted novelty 5.0

A latent diffusion model conditioned on line drawings estimates dense depth to reconstruct 3D wireframes, reporting 5.3% average depth error after training on over one million pairs.
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability
cs.SE 2026-04 unverdicted novelty 5.0

The Cognitive Circuit Breaker detects LLM hallucinations by computing the Cognitive Dissonance Delta between semantic confidence and latent certainty from hidden states, adding negligible overhead.
RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement
cs.CR 2026-04 unverdicted novelty 5.0

RefineRAG achieves 90% attack success on NQ by generating toxic seeds then optimizing them via retriever-in-the-loop word refinement, outperforming prior methods on effectiveness and naturalness.
Tug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations
cs.CL 2026-03 unverdicted novelty 5.0

CRVA-TGRAG combines parent-document segmentation, ensemble retrieval, and teacher-guided fine-tuning to mitigate knowledge conflicts and improve accuracy in LLM-based CVE vulnerability analysis.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
cs.SE 2024-01 unverdicted novelty 5.0

DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
cs.CL 2024-01 unverdicted novelty 5.0

DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization
cs.LG 2026-05 unverdicted novelty 4.0

Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
Agentic Application in Power Grid Static Analysis: Automatic Code Generation and Error Correction
eess.SY 2026-04 unverdicted novelty 4.0

An LLM agent with static pre-check, dynamic feedback, and semantic validation generates MATPOWER code from natural language for power grid analysis at 82.38% fidelity.
Identifying Topological Invariants of Non-Hermitian Systems via Domain-Adaptive Multimodal Model for Mathematics
cond-mat.other 2026-04 unverdicted novelty 4.0

A multimodal model with Qwen Math backbone identifies topological invariants of non-Hermitian systems from eigenvalues and eigenvectors in momentum space.
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
cs.CL 2026-03 accept novelty 4.0

A survey that taxonomizes data mixing strategies for LLM pretraining into static rule-based, learning-based, and dynamic adaptive families while highlighting transferability challenges and evaluation gaps.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
cs.AI 2024-03 unverdicted novelty 4.0

DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
TinyLlama: An Open-Source Small Language Model
cs.CL 2024-01 accept novelty 4.0

TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

128 extracted references · 128 canonical work pages · cited by 44 Pith papers · 31 internal anchors

[2]

Introducing Claude , 2023

Anthropic. Introducing Claude , 2023. URL https://www.anthropic.com/index/introducing-claude

work page 2023
[6]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...

work page 2020
[7]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Computer

T. Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data

work page 2023
[12]

T. Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023

work page 2023
[13]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022

work page 2022
[14]

Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022

work page 2022
[16]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile : An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[17]

An important next step on our AI journey, 2023

Google. An important next step on our AI journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/

work page 2023
[24]

Hai-llm: 高效且轻量的大模型训练工具, 2023

High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

work page 2023
[26]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval : A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023

work page arXiv 2023
[27]

Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019

Huggingface Team . Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019. URL https://github.com/huggingface/tokenizers

work page 2019
[28]

F. i, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR3wGCk-IXp

work page 2023
[29]

Ivison, Y

H. Ivison, Y. Wang, V. Pyatkin, N. Lambert, M. Peters, P. Dasigi, J. Jang, D. Wadden, N. A. Smith, I. Beltagy, and H. Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with tulu 2. 2023

work page 2023
[33]

V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023

work page 2023
[35]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[37]

H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU : Measuring massive multitask language understanding in Chinese . arXiv preprint arXiv:2306.09212, 2023

work page arXiv 2023
[38]

W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021

work page 2021
[43]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018

work page 2018
[44]

Narayanan, M

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021

work page 2021
[45]

Introducing ChatGPT , 2022

OpenAI. Introducing ChatGPT , 2022. URL https://openai.com/blog/chatgpt

work page 2022
[46]

GPT-4 Technical Report

OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[49]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[50]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. 2023

work page 2023
[51]

Rajbhandari, J

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020

work page 2020
[52]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019

work page 2019
[53]

C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20 0 (112): 0 1--49, 2019

work page 2019
[58]

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

work page 2024
[59]

K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019

work page 2019
[63]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[65]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

work page 2022
[66]

T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023

work page 2023
[68]

A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan, F. Yang, F. Deng, F. Wang, F. Liu, G. Ai, G. Dong, H. Zhao, H. Xu, H. Sun, H. Zhang, H. Liu, J. Ji, J. Xie, J. Dai, K. Fang, L. Su, L. Song, L. Liu, L. Ru, L. Ma, M. Wang, M. Liu, M. Lin, N. Nie, P. Guo, R. Sun, T. Zhang, T. Li, T. Li, W. Cheng, W. Chen, X. Zeng, X. Wang, X. Chen...

work page 2023
[71]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[72]

Zhang, L

G. Zhang, L. Li, Z. Nado, J. Martens, S. Sachdeva, G. Dahl, C. Shallue, and R. B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019

work page 2019
[74]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023

work page 2023
[77]

The Eleventh International Conference on Learning Representations,

Freda i and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

work page 2023
[78]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[79]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review arXiv
[80]

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=

work page arXiv
[81]

Tora: A tool-integrated reasoning agent for mathematical problem solving

Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.17452 , eprinttype =. 2309.17452 , timestamp =

work page doi:10.48550/arxiv.2309.17452 2023
[82]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen and Xueguang Ma and Xinyi Wang and William W. Cohen , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2211.12588 , eprinttype =. 2211.12588 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2211.12588 2022
[83]

International Conference on Machine Learning,

Luyu Gao and Aman Madaan and Shuyan Zhou and Uri Alon and Pengfei Liu and Yiming Yang and Jamie Callan and Graham Neubig , editor =. International Conference on Machine Learning,. 2023 , url =

work page 2023
[84]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =

work page
[85]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,

Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan , editor =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.392 , timestamp =

work page doi:10.18653/v1/2022.emnlp-main.392 2022
[86]

arXiv preprint arXiv:2309.05653 , year=

Xiang Yue and Xingwei Qu and Ge Zhang and Yao Fu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.05653 , eprinttype =. 2309.05653 , timestamp =

work page doi:10.48550/arxiv.2309.05653 2023
[87]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu and Weisen Jiang and Han Shi and Jincheng Yu and Zhengying Liu and Yu Zhang and James T. Kwok and Zhenguo Li and Adrian Weller and Weiyang Liu , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.12284 , eprinttype =. 2309.12284 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2309.12284 2023
[88]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017
[89]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[90]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[91]

Introducing

OpenAI , url =. Introducing

work page
[92]

HAI-LLM: 高效且轻量的大模型训练工具 , author =

work page
[93]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[94]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

work page
[95]

Proceedings of Machine Learning and Systems , volume=

Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=

work page
[96]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=

work page
[97]

Dao, Tri , year=. Flash

work page
[98]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[99]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[100]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020
[101]

2021 , eprint=

CCPM: A Chinese Classical Poetry Matching Dataset , author=. 2021 , eprint=

work page 2021
[102]

2018 , eprint=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. 2018 , eprint=

work page 2018
[103]

Introducing

Anthropic , institution =. Introducing

work page
[104]

An important next step on our

Google , url =. An important next step on our

work page
[105]

2019 , eprint=

Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension , author=. 2019 , eprint=

work page 2019
[106]

A Span-Extraction Dataset for C hinese Machine Reading Comprehension

Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping. A Span-Extraction Dataset for C hinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (E...

work page doi:10.18653/v1/d19-1600 2019
[107]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019
[108]

2023 , eprint=

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? , author=. 2023 , eprint=

work page 2023
[109]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[110]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review arXiv
[111]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[112]

Proceedings of the 28th International Conference on Computational Linguistics,

Liang Xu and Hai Hu and Xuanwei Zhang and Lu Li and Chenjie Cao and Yudong Li and Yechen Xu and Kai Sun and Dian Yu and Cong Yu and Yin Tian and Qianqian Dong and Weitang Liu and Bo Shi and Yiming Cui and Junyi Li and Jun Zeng and Rongzhao Wang and Weijian Xie and Yanting Li and Yina Patterson and Zuoyu Tian and Yiwen Zhang and He Zhou and Shaoweihua Liu ...

work page doi:10.18653/v1/2020.coling-main.419 2020
[113]

Li, Haonan and Zhang, Yixuan and Koto, Fajri and Yang, Yifei and Zhao, Hai and Gong, Yeyun and Duan, Nan and Baldwin, Timothy , journal=

work page
[114]

Chujie Zheng and Minlie Huang and Aixin Sun , editor =. ChID:. Proceedings of the 57th Conference of the Association for Computational Linguistics,. 2019 , url =. doi:10.18653/V1/P19-1075 , timestamp =

work page doi:10.18653/v1/p19-1075 2019
[115]

RACE : Large-scale R e A ding comprehension dataset from examinations

Guokun Lai and Qizhe Xie and Hanxiao Liu and Yiming Yang and Eduard H. Hovy , editor =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,. 2017 , url =. doi:10.18653/V1/D17-1082 , timestamp =

work page doi:10.18653/v1/d17-1082 2017
[116]

doi:10.18653/v1/N19-1246 , editor =

Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner , editor =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =. doi:10.18653/V1/N19-1246 , timestamp =

work page doi:10.18653/v1/n19-1246 2019
[117]

Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and others , journal=

work page
[118]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[119]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton. Llama 2: Open Foundation and Fine-Tuned Chat Models , journal =. 2023 , url =. doi:10.48550/arXiv.2307.09288 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

Showing first 80 references.