DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

A.X. Liu; Bingxuan Wang; Bo Liu; B. Zhang; Chenggang Zhao; Chengqi Deng; Chong Ruan; Damai Dai; Daya Guo; DeepSeek-AI: Xiao Bi

arxiv: 2401.02954 · v1 · submitted 2024-01-05 · 💻 cs.CL · cs.AI· cs.LG

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI: Xiao Bi , Deli Chen , Guanting Chen , Shanhuang Chen , Damai Dai , Chengqi Deng , Honghui Ding , Kai Dong

show 78 more authors

Qiushi Du Zhe Fu Huazuo Gao Kaige Gao Wenjun Gao Ruiqi Ge Kang Guan Daya Guo Jianzhong Guo Guangbo Hao Zhewen Hao Ying He Wenjie Hu Panpan Huang Erhang Li Guowei Li Jiashi Li Yao Li Y.K. Li Wenfeng Liang Fangyun Lin A.X. Liu Bo Liu Wen Liu XiaoDong Liu Xin Liu Yiyuan Liu Haoyu Lu Shanghao Lu Fuli Luo Shirong Ma Xiaotao Nie Tian Pei Yishi Piao Junjie Qiu Hui Qu Tongzheng Ren Zehui Ren Chong Ruan Zhangli Sha Zhihong Shao Junxiao Song Xuecheng Su Jingxiang Sun Yaofeng Sun Minghui Tang Bingxuan Wang Peiyi Wang Shiyu Wang Yaohui Wang Yongji Wang Tong Wu Y. Wu Xin Xie Zhenda Xie Ziwei Xie Yiliang Xiong Hanwei Xu R.X. Xu Yanhong Xu Dejian Yang Yuxiang You Shuiping Yu Xingkai Yu B. Zhang Haowei Zhang Lecong Zhang Liyue Zhang Mingchuan Zhang Minghua Zhang Wentao Zhang Yichao Zhang Chenggang Zhao Yao Zhao Shangyan Zhou Shunfeng Zhou Qihao Zhu Yuheng Zou

This is my paper

Pith reviewed 2026-05-11 06:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords scaling lawslarge language modelsopen-source modelspre-training datasetsupervised fine-tuningdirect preference optimizationbenchmark evaluationmodel performance

0 comments

The pith

DeepSeek LLM 67B surpasses LLaMA-2 70B on code, mathematics and reasoning benchmarks, with its chat version exceeding GPT-3.5 in open-ended evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines scaling laws for large language models and identifies patterns that support effective training at the common open-source sizes of 7 billion and 67 billion parameters. A dataset beginning at 2 trillion tokens and designed to keep growing is used to pre-train the DeepSeek LLM base models. Supervised fine-tuning followed by direct preference optimization then produces chat versions whose performance exceeds that of LLaMA-2 70B on standard benchmarks, especially in code, mathematics and reasoning, while the 67B chat model also outperforms GPT-3.5 in open-ended tests. A sympathetic reader would care because the work shows how open projects can pursue steady, long-horizon scaling to narrow the gap with proprietary systems using publicly described methods and data growth.

Core claim

Guided by our distinctive findings on scaling laws, we train DeepSeek LLM base models in 7B and 67B configurations on a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further apply supervised fine-tuning and direct preference optimization to produce DeepSeek Chat models. Evaluation shows that DeepSeek LLM 67B surpasses LLaMA-2 70B across various benchmarks with particular strength in code, mathematics and reasoning, while open-ended evaluations indicate that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

What carries the argument

Distinctive findings on scaling laws that guide effective training in 7B and 67B sizes, implemented through a continuously expanding 2 trillion token dataset plus supervised fine-tuning and direct preference optimization.

If this is right

DeepSeek LLM 67B records higher scores than LLaMA-2 70B on standard benchmarks, especially those involving code, mathematics and reasoning.
The 67B chat model achieves better results than GPT-3.5 when evaluated on open-ended tasks.
The same scaling approach with ongoing data growth can be applied to produce further improvements in open-source models at these sizes.
Long-term expansion of the training dataset supports continued progress without requiring changes to the core training configuration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the identified scaling patterns persist, further growth of the token dataset beyond the current 2 trillion could yield additional performance lifts in the same model sizes.
Open projects following this data-first, long-horizon route may gradually close capability gaps with closed models on reasoning-heavy tasks.
Re-running the comparisons on entirely new benchmark suites would test whether the observed advantages generalize beyond the reported set.
The emphasis on sustained data collection could encourage similar multi-year efforts in other open-source language-model initiatives.

Load-bearing premise

The selected benchmarks and open-ended evaluations measure genuine model capability without undisclosed overlap in training data or advantages in methodology.

What would settle it

Independent re-testing on a fresh set of benchmarks withheld from the original evaluation that shows DeepSeek LLM 67B no longer outperforming LLaMA-2 70B or its chat version no longer exceeding GPT-3.5.

read the original abstract

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepSeek's 67B model is a worthwhile open-source release that delivers on the performance claims with transparent enough methods.

read the letter

The key thing with this DeepSeek LLM paper is their 67B base model outperforming LLaMA-2 70B on code, mathematics, and reasoning benchmarks, and the chat version doing better than GPT-3.5 in open-ended evaluations. They trained both 7B and 67B variants on a 2 trillion token dataset and used scaling law experiments to inform the process before doing SFT and DPO. What stands out is how they detail the dataset construction, the scaling-law fits for the two sizes, and the evaluation setup with few-shot prompts and decontamination in the full manuscript. The reported performance deltas are consistent with the model scale and data volume, and the scaling curves don't show any internal contradictions. The soft spots are limited. Their scaling law observations mostly line up with what's already in the literature rather than introducing a new paradigm, and the open-ended evaluations, while useful, can be sensitive to how the tests are run. More public ablations on data mixtures would strengthen it, but nothing looks load-bearing wrong. This work is aimed at researchers and developers working with open-source LLMs, especially those focused on code and reasoning tasks. It has clear thinking behind the training choices and reproducible elements in the methods, so it deserves to go through peer review for broader feedback and to highlight the model releases.

Referee Report

2 major / 3 minor

Summary. The paper introduces DeepSeek LLM, an open-source project focused on long-term scaling of LLMs. It reports empirical studies of scaling laws for 7B and 67B models, describes pre-training a base model on a 2-trillion-token dataset that continues to grow, and applies supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) to create chat variants. The central empirical claims are that DeepSeek LLM 67B outperforms LLaMA-2 70B on code, mathematics, and reasoning benchmarks, and that the 67B Chat model shows superior performance to GPT-3.5 in open-ended evaluations.

Significance. If the benchmark results hold under scrutiny, the work is significant for advancing reproducible open-source LLMs by releasing competitive 67B-scale models trained with explicit long-term data scaling. The inclusion of scaling-law experiments, dataset construction details, and decontamination protocols in the methods section provides a useful reference for the community and supports the reported performance deltas.

major comments (2)

[Evaluation] Open-ended evaluation section: the claim that DeepSeek LLM 67B Chat exhibits superior performance to GPT-3.5 rests on unspecified details of the evaluation protocol (prompting strategy, judge model or human raters, and any agreement metrics). Without these, the result cannot be independently verified and is load-bearing for the chat-model contribution.
[§5] Benchmark results (tables in §5): while decontamination steps are described, the paper does not report the fraction of test-set overlap removed or provide before/after scores; this leaves open the possibility that domain-specific gains (code/math) partly reflect data leakage rather than model capability.

minor comments (3)

[Scaling Laws] Figure captions for scaling curves should explicitly list the fitted exponents and any confidence intervals; current plots are difficult to reproduce from the text alone.
[Abstract] The abstract uses 'longtermism' without definition; a one-sentence gloss would improve accessibility for readers outside the immediate subfield.
[Evaluation] Several benchmark tables lack standard deviations or number of runs; adding these would strengthen the statistical interpretation of the reported deltas over LLaMA-2.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and positive overall assessment. We address each major comment below, indicating where revisions will be made to enhance verifiability and transparency.

read point-by-point responses

Referee: [Evaluation] Open-ended evaluation section: the claim that DeepSeek LLM 67B Chat exhibits superior performance to GPT-3.5 rests on unspecified details of the evaluation protocol (prompting strategy, judge model or human raters, and any agreement metrics). Without these, the result cannot be independently verified and is load-bearing for the chat-model contribution.

Authors: We agree that full specification of the evaluation protocol is necessary for independent verification of the open-ended results. In the revised manuscript we will expand the relevant section to detail the prompting strategy, the judge model employed, the involvement of human raters (if any), and quantitative agreement metrics such as inter-rater reliability scores. These additions will directly support the claim of superior performance relative to GPT-3.5. revision: yes
Referee: [§5] Benchmark results (tables in §5): while decontamination steps are described, the paper does not report the fraction of test-set overlap removed or provide before/after scores; this leaves open the possibility that domain-specific gains (code/math) partly reflect data leakage rather than model capability.

Authors: We acknowledge the value of quantifying the decontamination impact. We will revise the methods and results sections to report the fraction of test-set overlap removed for each benchmark category. However, providing complete before/after benchmark scores would require retraining the 67B model on the full 2-trillion-token corpus without decontamination, which is computationally prohibitive. We will instead clarify that the described decontamination procedure was applied uniformly and that performance advantages appear consistently across diverse benchmarks. revision: partial

standing simulated objections not resolved

Provision of before/after benchmark scores comparing models trained with and without decontamination, due to the prohibitive computational cost of retraining at 2-trillion-token scale.

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results are self-contained

full rationale

The paper's core claims consist of observed performance deltas on external benchmarks (code, math, reasoning, open-ended chat) after training a 67B model on an expanding 2T-token corpus followed by SFT+DPO. Scaling-law experiments are described as guiding dataset and model choices but do not reduce any reported result to a fitted parameter renamed as a prediction; the evaluation protocols, decontamination steps, and few-shot settings are stated explicitly and independently of the final scores. No self-definitional equations, load-bearing self-citations, or ansatz smuggling appear in the derivation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on empirical training runs and benchmark comparisons rather than new theoretical derivations; the main unstated inputs are standard assumptions about scaling laws and the effectiveness of SFT/DPO.

free parameters (2)

model scale
7B and 67B sizes selected after scaling-law study
pre-training data volume
2 trillion tokens assembled for the reported runs

axioms (2)

domain assumption Scaling laws reliably predict performance gains with increased model size and data
Paper states it delved into scaling laws to guide the 7B/67B choices
domain assumption SFT followed by DPO produces aligned chat models that generalize on benchmarks
Used to create the Chat variants whose superiority is claimed

pith-pipeline@v0.9.0 · 5848 in / 1451 out tokens · 62866 ms · 2026-05-11T06:03:01.583359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning.
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective.
IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
cs.LG 2026-05 unverdicted novelty 7.0

Derives a blockwise resolvent-style attention operator that exploits structured sparsity for subquadratic O(n^{4/3}d) entity tracking while matching dense accuracy.
Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign
cs.LG 2026-05 unverdicted novelty 7.0

Peak-Detector uses instruction-tuned LLMs and a condensed peak-representation of time-series data to achieve robust cross-modal peak detection with self-generated explanations across ECG, PPG, BCG, and BSG signals.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
cs.CL 2026-05 unverdicted novelty 7.0

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
Causal Bias Detection in Generative Artificial Intelligence
cs.AI 2026-05 unverdicted novelty 7.0

Develops a causal framework unifying generative AI fairness with standard ML, with new decompositions, identification conditions, and estimators demonstrated on LLM race and gender bias.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
cs.AR 2026-03 unverdicted novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?
cs.CL 2026-03 unverdicted novelty 7.0

The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
Obliviator Reveals the Cost of Nonlinear Guardedness in Concept Erasure
cs.LG 2026-03 unverdicted novelty 7.0

Obliviator introduces an iterative kernel-based optimization for nonlinear concept erasure that quantifies the utility cost of guarding against nonlinear adversaries and outperforms prior methods on trade-off curves.
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
cs.CV 2025-11 conditional novelty 7.0

UniGeoSeg releases the first million-scale dataset for instruction-driven remote sensing segmentation and a unified model that achieves state-of-the-art results with strong zero-shot generalization.
Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models
cs.SE 2025-10 unverdicted novelty 7.0

LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.
Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs
cs.CV 2025-05 unverdicted novelty 7.0

DORI benchmark shows top vision-language models reach only 54.2% accuracy on coarse orientation tasks and 33% on granular judgments, with sharp drops on reference-frame shifts and compound rotations.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
cs.CV 2024-10 unverdicted novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer
cs.CL 2024-08 unverdicted novelty 7.0

Task prompt vectors, formed by subtracting random initialization from tuned soft prompts, support low-resource initialization and arithmetic combination across tasks on 12 NLU datasets while remaining independent of i...
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
CodeMind: Evaluating Large Language Models for Code Reasoning
cs.SE 2024-02 unverdicted novelty 7.0

CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
cs.LG 2026-05 unverdicted novelty 6.0

A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing t...
Reading Calibrated Uncertainty from Language Model Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

Geometric features from per-layer MLP update trajectories fed to a sparse linear probe outperform maximum softmax probability for uncertainty quantification under selective abstention, with gains up to 21 AURC points.
Contextualized Code Pretraining for Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
SEED: Targeted Data Selection by Weighted Independent Set
cs.LG 2026-05 unverdicted novelty 6.0

SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods o...
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
cs.AI 2026-05 unverdicted novelty 6.0

MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.
Causal Bias Detection in Generative Artificial Intelligence
cs.AI 2026-05 unverdicted novelty 6.0

A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
cs.AI 2026-05 unverdicted novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
Training continuously-coupled reconfigurable photonic chips with quantum machine learning
quant-ph 2026-05 unverdicted novelty 6.0

A black-box machine learning technique trains continuously-coupled photonic waveguide arrays to implement target unitaries using limited single- and two-photon measurements without requiring detailed internal models.
Predicting Large Model Test Losses with a Noisy Quadratic System
cs.LG 2026-05 unverdicted novelty 6.0

A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing
cs.AR 2026-05 unverdicted novelty 6.0

DSPE is an edge processor that achieves 109.4 TFLOPS/W for DeepSeek inference using Merkle tree-based incremental pruning, multi-stage boothing lookup, and dynamic adaptive posit processing.
RELO: Reinforcement Learning to Localize for Visual Object Tracking
cs.CV 2026-05 unverdicted novelty 6.0

RELO formulates visual object tracking localization as a Markov decision process solved by reinforcement learning with combined IoU and AUC rewards, augmented by layer-aligned temporal token propagation, and reports 5...
RELO: Reinforcement Learning to Localize for Visual Object Tracking
cs.CV 2026-05 unverdicted novelty 6.0

RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
cs.CL 2026-05 unverdicted novelty 6.0

InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...
Rethinking LLM Ensembling from the Perspective of Mixture Models
cs.LG 2026-05 unverdicted novelty 6.0

ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.
ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs
cs.AI 2026-04 unverdicted novelty 6.0

ReaGeo is an end-to-end LLM framework for geocoding that uses geohash text generation, Chain-of-Thought spatial reasoning, and distance-based RL to accurately predict points and regions from explicit and vague queries.
Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling
cs.LG 2026-04 unverdicted novelty 6.0

AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbi...
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
cs.AI 2026-04 unverdicted novelty 6.0

Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
AFGNN: API Misuse Detection using Graph Neural Networks and Clustering
cs.SE 2026-04 unverdicted novelty 6.0

AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
cs.LG 2026-02 unverdicted novelty 6.0

RAT+ pretrains a dense recurrent-augmented attention model once and enables flexible switching to dilated or hybrid sparse attention at inference after short adaptation, with small accuracy loss at high dilation factors.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
cs.LG 2026-02 conditional novelty 6.0

RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...
SWaRL: Safeguard Code Watermarking via Reinforcement Learning
cs.CR 2026-01 unverdicted novelty 6.0

SWaRL trains code LLMs with RL using compiler correctness signals and a confidential verifier reward to embed robust, functionality-preserving watermarks that resist refactoring attacks.
Foundation Models for Discovery and Exploration in Chemical Space
physics.chem-ph 2025-10 unverdicted novelty 6.0

MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs
cs.LG 2025-06 unverdicted novelty 6.0

MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
cs.CL 2025-06 conditional novelty 6.0

MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
cs.LG 2025-05 conditional novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Extracting memorized pieces of (copyrighted) books from open-weight language models
cs.CL 2025-05 conditional novelty 6.0

A new extraction technique applied to 200 books and 14 LLMs finds that memorization of full books is rare except in specific high-capacity models where entire texts can be recovered verbatim.
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
cs.DC 2025-04 unverdicted novelty 6.0

MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource...
A Study of LLMs' Preferences for Libraries and Programming Languages
cs.SE 2025-03 unverdicted novelty 6.0

Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
cs.CL 2024-12 unverdicted novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
cs.LG 2024-10 unverdicted novelty 6.0

Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.
Optimization Hyper-parameter Laws for Large Language Models
cs.LG 2024-09 unverdicted novelty 6.0

Opt-Laws predicts LLM final training loss from LR schedules via SDE-derived convergence and escape features, with 94% Top-2 hit rate on held-out schedules and F1=0.92 for divergence detection.
Scaling Synthetic Data Creation with 1,000,000,000 Personas
cs.CL 2024-06 unverdicted novelty 6.0

A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
DataComp-LM: In search of the next generation of training sets for language models
cs.LG 2024-06 unverdicted novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
cs.CL 2024-02 unverdicted novelty 6.0

DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
cs.CV 2024-01 conditional novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
Adversarial Reframing: A Framework for Targeted Generation in Language Models
cs.CR 2026-05 unverdicted novelty 5.0

THREAT uses coordinated LLMs in an iterative optimization loop to generate jailbreak prompts that achieve higher success rates and lower detection rates than previous methods across tested models and datasets.

Reference graph

Works this paper leans on

128 extracted references · 128 canonical work pages · cited by 86 Pith papers · 37 internal anchors

[2]

Introducing Claude , 2023

Anthropic. Introducing Claude , 2023. URL https://www.anthropic.com/index/introducing-claude

work page 2023
[6]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...

work page 2020
[7]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Computer

T. Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data

work page 2023
[12]

T. Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023

work page 2023
[13]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022

work page 2022
[14]

Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022

work page 2022
[16]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile : An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[17]

An important next step on our AI journey, 2023

Google. An important next step on our AI journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/

work page 2023
[24]

Hai-llm: 高效且轻量的大模型训练工具, 2023

High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

work page 2023
[26]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval : A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023

work page arXiv 2023
[27]

Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019

Huggingface Team . Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019. URL https://github.com/huggingface/tokenizers

work page 2019
[28]

F. i, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR3wGCk-IXp

work page 2023
[29]

Ivison, Y

H. Ivison, Y. Wang, V. Pyatkin, N. Lambert, M. Peters, P. Dasigi, J. Jang, D. Wadden, N. A. Smith, I. Beltagy, and H. Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with tulu 2. 2023

work page 2023
[33]

V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023

work page 2023
[35]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[37]

H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU : Measuring massive multitask language understanding in Chinese . arXiv preprint arXiv:2306.09212, 2023

work page internal anchor Pith review arXiv 2023
[38]

W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021

work page 2021
[43]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018

work page 2018
[44]

Narayanan, M

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021

work page 2021
[45]

Introducing ChatGPT , 2022

OpenAI. Introducing ChatGPT , 2022. URL https://openai.com/blog/chatgpt

work page 2022
[46]

GPT-4 Technical Report

OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[49]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[50]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. 2023

work page 2023
[51]

Rajbhandari, J

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020

work page 2020
[52]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019

work page 2019
[53]

C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20 0 (112): 0 1--49, 2019

work page 2019
[58]

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

work page 2024
[59]

K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019

work page 2019
[63]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[65]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

work page 2022
[66]

T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023

work page 2023
[68]

A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan, F. Yang, F. Deng, F. Wang, F. Liu, G. Ai, G. Dong, H. Zhao, H. Xu, H. Sun, H. Zhang, H. Liu, J. Ji, J. Xie, J. Dai, K. Fang, L. Su, L. Song, L. Liu, L. Ru, L. Ma, M. Wang, M. Liu, M. Lin, N. Nie, P. Guo, R. Sun, T. Zhang, T. Li, T. Li, W. Cheng, W. Chen, X. Zeng, X. Wang, X. Chen...

work page 2023
[71]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[72]

Zhang, L

G. Zhang, L. Li, Z. Nado, J. Martens, S. Sachdeva, G. Dahl, C. Shallue, and R. B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019

work page 2019
[74]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023

work page 2023
[77]

The Eleventh International Conference on Learning Representations,

Freda i and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

work page 2023
[78]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[79]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review arXiv
[80]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=

work page internal anchor Pith review arXiv
[81]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.17452 , eprinttype =. 2309.17452 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2309.17452 2023
[82]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen and Xueguang Ma and Xinyi Wang and William W. Cohen , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2211.12588 , eprinttype =. 2211.12588 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2211.12588 2022
[83]

International Conference on Machine Learning,

Luyu Gao and Aman Madaan and Shuyan Zhou and Uri Alon and Pengfei Liu and Yiming Yang and Jamie Callan and Graham Neubig , editor =. International Conference on Machine Learning,. 2023 , url =

work page 2023
[84]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =

work page
[85]

Mishra, M

Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan , editor =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.392 , timestamp =

work page doi:10.18653/v1/2022.emnlp-main.392 2022
[86]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Xiang Yue and Xingwei Qu and Ge Zhang and Yao Fu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.05653 , eprinttype =. 2309.05653 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2309.05653 2023
[87]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu and Weisen Jiang and Han Shi and Jincheng Yu and Zhengying Liu and Yu Zhang and James T. Kwok and Zhenguo Li and Adrian Weller and Weiyang Liu , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.12284 , eprinttype =. 2309.12284 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2309.12284 2023
[88]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017
[89]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[90]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[91]

Introducing

OpenAI , url =. Introducing

work page
[92]

HAI-LLM: 高效且轻量的大模型训练工具 , author =

work page
[93]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[94]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

work page
[95]

Proceedings of Machine Learning and Systems , volume=

Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=

work page
[96]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=

work page
[97]

Dao, Tri , year=. Flash

work page
[98]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[99]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[100]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020
[101]

2021 , eprint=

CCPM: A Chinese Classical Poetry Matching Dataset , author=. 2021 , eprint=

work page 2021
[102]

2018 , eprint=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. 2018 , eprint=

work page 2018
[103]

Introducing

Anthropic , institution =. Introducing

work page
[104]

An important next step on our

Google , url =. An important next step on our

work page
[105]

2019 , eprint=

Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension , author=. 2019 , eprint=

work page 2019
[106]

A Span-Extraction Dataset for C hinese Machine Reading Comprehension

Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping. A Span-Extraction Dataset for C hinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (E...

work page doi:10.18653/v1/d19-1600 2019
[107]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019
[108]

2023 , eprint=

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? , author=. 2023 , eprint=

work page 2023
[109]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[110]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review arXiv
[111]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[112]

Proceedings of the 28th International Conference on Computational Linguistics,

Liang Xu and Hai Hu and Xuanwei Zhang and Lu Li and Chenjie Cao and Yudong Li and Yechen Xu and Kai Sun and Dian Yu and Cong Yu and Yin Tian and Qianqian Dong and Weitang Liu and Bo Shi and Yiming Cui and Junyi Li and Jun Zeng and Rongzhao Wang and Weijian Xie and Yanting Li and Yina Patterson and Zuoyu Tian and Yiwen Zhang and He Zhou and Shaoweihua Liu ...

work page doi:10.18653/v1/2020.coling-main.419 2020
[113]

Li, Haonan and Zhang, Yixuan and Koto, Fajri and Yang, Yifei and Zhao, Hai and Gong, Yeyun and Duan, Nan and Baldwin, Timothy , journal=

work page
[114]

Chujie Zheng and Minlie Huang and Aixin Sun , editor =. ChID:. Proceedings of the 57th Conference of the Association for Computational Linguistics,. 2019 , url =. doi:10.18653/V1/P19-1075 , timestamp =

work page doi:10.18653/v1/p19-1075 2019
[115]

doi: 10.18653/v1/D17-1082

Guokun Lai and Qizhe Xie and Hanxiao Liu and Yiming Yang and Eduard H. Hovy , editor =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,. 2017 , url =. doi:10.18653/V1/D17-1082 , timestamp =

work page doi:10.18653/v1/d17-1082 2017
[116]

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M

Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner , editor =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =. doi:10.18653/V1/N19-1246 , timestamp =

work page doi:10.18653/v1/n19-1246 2019
[117]

Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and others , journal=

work page
[118]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[119]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton. Llama 2: Open Foundation and Fine-Tuned Chat Models , journal =. 2023 , url =. doi:10.48550/arXiv.2307.09288 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

Showing first 80 references.

[1] [2]

Introducing Claude , 2023

Anthropic. Introducing Claude , 2023. URL https://www.anthropic.com/index/introducing-claude

work page 2023

[2] [6]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...

work page 2020

[3] [7]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [10]

Computer

T. Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data

work page 2023

[5] [12]

T. Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023

work page 2023

[6] [13]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022

work page 2022

[7] [14]

Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022

work page 2022

[8] [16]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile : An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[9] [17]

An important next step on our AI journey, 2023

Google. An important next step on our AI journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/

work page 2023

[10] [24]

Hai-llm: 高效且轻量的大模型训练工具, 2023

High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

work page 2023

[11] [26]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval : A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023

work page arXiv 2023

[12] [27]

Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019

Huggingface Team . Tokenizers : Fast state-of-the-art tokenizers optimized for research and production, 2019. URL https://github.com/huggingface/tokenizers

work page 2019

[13] [28]

F. i, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR3wGCk-IXp

work page 2023

[14] [29]

Ivison, Y

H. Ivison, Y. Wang, V. Pyatkin, N. Lambert, M. Peters, P. Dasigi, J. Jang, D. Wadden, N. A. Smith, I. Beltagy, and H. Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with tulu 2. 2023

work page 2023

[15] [33]

V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023

work page 2023

[16] [35]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[17] [37]

H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU : Measuring massive multitask language understanding in Chinese . arXiv preprint arXiv:2306.09212, 2023

work page internal anchor Pith review arXiv 2023

[18] [38]

W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021

work page 2021

[19] [43]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018

work page 2018

[20] [44]

Narayanan, M

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021

work page 2021

[21] [45]

Introducing ChatGPT , 2022

OpenAI. Introducing ChatGPT , 2022. URL https://openai.com/blog/chatgpt

work page 2022

[22] [46]

GPT-4 Technical Report

OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [47]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022

[24] [49]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019

[25] [50]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. 2023

work page 2023

[26] [51]

Rajbhandari, J

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020

work page 2020

[27] [52]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019

work page 2019

[28] [53]

C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20 0 (112): 0 1--49, 2019

work page 2019

[29] [58]

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

work page 2024

[30] [59]

K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019

work page 2019

[31] [63]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[32] [65]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

work page 2022

[33] [66]

T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023

work page 2023

[34] [68]

A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan, F. Yang, F. Deng, F. Wang, F. Liu, G. Ai, G. Dong, H. Zhao, H. Xu, H. Sun, H. Zhang, H. Liu, J. Ji, J. Xie, J. Dai, K. Fang, L. Su, L. Song, L. Liu, L. Ru, L. Ma, M. Wang, M. Liu, M. Lin, N. Nie, P. Guo, R. Sun, T. Zhang, T. Li, T. Li, W. Cheng, W. Chen, X. Zeng, X. Wang, X. Chen...

work page 2023

[35] [71]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[36] [72]

Zhang, L

G. Zhang, L. Li, Z. Nado, J. Martens, S. Sachdeva, G. Dahl, C. Shallue, and R. B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019

work page 2019

[37] [74]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023

work page 2023

[38] [77]

The Eleventh International Conference on Learning Representations,

Freda i and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

work page 2023

[39] [78]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024

[40] [79]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review arXiv

[41] [80]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=

work page internal anchor Pith review arXiv

[42] [81]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.17452 , eprinttype =. 2309.17452 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2309.17452 2023

[43] [82]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen and Xueguang Ma and Xinyi Wang and William W. Cohen , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2211.12588 , eprinttype =. 2211.12588 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2211.12588 2022

[44] [83]

International Conference on Machine Learning,

Luyu Gao and Aman Madaan and Shuyan Zhou and Uri Alon and Pengfei Liu and Yiming Yang and Jamie Callan and Graham Neubig , editor =. International Conference on Machine Learning,. 2023 , url =

work page 2023

[45] [84]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =

work page

[46] [85]

Mishra, M

Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan , editor =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.392 , timestamp =

work page doi:10.18653/v1/2022.emnlp-main.392 2022

[47] [86]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Xiang Yue and Xingwei Qu and Ge Zhang and Yao Fu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.05653 , eprinttype =. 2309.05653 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2309.05653 2023

[48] [87]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu and Weisen Jiang and Han Shi and Jincheng Yu and Zhengying Liu and Yu Zhang and James T. Kwok and Zhenguo Li and Adrian Weller and Weiyang Liu , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.12284 , eprinttype =. 2309.12284 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2309.12284 2023

[49] [88]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017

[50] [89]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020

[51] [90]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page

[52] [91]

Introducing

OpenAI , url =. Introducing

work page

[53] [92]

HAI-LLM: 高效且轻量的大模型训练工具 , author =

work page

[54] [93]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909

[55] [94]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

work page

[56] [95]

Proceedings of Machine Learning and Systems , volume=

Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=

work page

[57] [96]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=

work page

[58] [97]

Dao, Tri , year=. Flash

work page

[59] [98]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page

[60] [99]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page

[61] [100]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020

[62] [101]

2021 , eprint=

CCPM: A Chinese Classical Poetry Matching Dataset , author=. 2021 , eprint=

work page 2021

[63] [102]

2018 , eprint=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. 2018 , eprint=

work page 2018

[64] [103]

Introducing

Anthropic , institution =. Introducing

work page

[65] [104]

An important next step on our

Google , url =. An important next step on our

work page

[66] [105]

2019 , eprint=

Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension , author=. 2019 , eprint=

work page 2019

[67] [106]

A Span-Extraction Dataset for C hinese Machine Reading Comprehension

Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping. A Span-Extraction Dataset for C hinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (E...

work page doi:10.18653/v1/d19-1600 2019

[68] [107]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019

[69] [108]

2023 , eprint=

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? , author=. 2023 , eprint=

work page 2023

[70] [109]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009

[71] [110]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review arXiv

[72] [111]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[73] [112]

Proceedings of the 28th International Conference on Computational Linguistics,

Liang Xu and Hai Hu and Xuanwei Zhang and Lu Li and Chenjie Cao and Yudong Li and Yechen Xu and Kai Sun and Dian Yu and Cong Yu and Yin Tian and Qianqian Dong and Weitang Liu and Bo Shi and Yiming Cui and Junyi Li and Jun Zeng and Rongzhao Wang and Weijian Xie and Yanting Li and Yina Patterson and Zuoyu Tian and Yiwen Zhang and He Zhou and Shaoweihua Liu ...

work page doi:10.18653/v1/2020.coling-main.419 2020

[74] [113]

Li, Haonan and Zhang, Yixuan and Koto, Fajri and Yang, Yifei and Zhao, Hai and Gong, Yeyun and Duan, Nan and Baldwin, Timothy , journal=

work page

[75] [114]

Chujie Zheng and Minlie Huang and Aixin Sun , editor =. ChID:. Proceedings of the 57th Conference of the Association for Computational Linguistics,. 2019 , url =. doi:10.18653/V1/P19-1075 , timestamp =

work page doi:10.18653/v1/p19-1075 2019

[76] [115]

doi: 10.18653/v1/D17-1082

Guokun Lai and Qizhe Xie and Hanxiao Liu and Yiming Yang and Eduard H. Hovy , editor =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,. 2017 , url =. doi:10.18653/V1/D17-1082 , timestamp =

work page doi:10.18653/v1/d17-1082 2017

[77] [116]

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M

Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner , editor =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =. doi:10.18653/V1/N19-1246 , timestamp =

work page doi:10.18653/v1/n19-1246 2019

[78] [117]

Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and others , journal=

work page

[79] [118]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[80] [119]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton. Llama 2: Open Foundation and Fine-Tuned Chat Models , journal =. 2023 , url =. doi:10.48550/arXiv.2307.09288 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023