Enhancing Fitness Intelligence through Domain-Specific LLM Post-Training

Han Jiang; Tian Yang; Xingtao Zhao

arxiv: 2607.02118 · v1 · pith:KCE7YE6Mnew · submitted 2026-07-02 · 💻 cs.AI

Enhancing Fitness Intelligence through Domain-Specific LLM Post-Training

Xingtao Zhao , Tian Yang , Han Jiang This is my paper

Pith reviewed 2026-07-03 13:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords fitness LLMsdomain-specific post-trainingACSM-EPNSCA-CSCScontinual pre-trainingsupervised fine-tuningreinforcement learningQwen3

0 comments

The pith

FitOne models improve scores on fitness certification exams by up to 12.73 percent over base Qwen3 models while retaining general capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FitOne, a pair of 8B and 32B parameter language models built from Qwen3 bases for scientific fitness coaching. It applies a three-stage post-training sequence of continual pre-training, supervised fine-tuning, and reinforcement learning on large-scale fitness datasets created through knowledge engineering. The resulting models record higher average scores on the ACSM-EP and NSCA-CSCS professional exams than the original Qwen3 models. They also keep comparable performance on general knowledge, reasoning, and instruction-following benchmarks. Ablation experiments indicate that each of the three stages contributes to the observed balance between domain gains and retained general ability.

Core claim

FitOne-8B and FitOne-32B, produced by applying continual pre-training, supervised fine-tuning, and reinforcement learning to Qwen3 foundation models on high-quality fitness datasets, achieve average improvements of up to 10.09 percent and 9.29 percent on the ACSM-EP exam and 12.73 percent and 7.01 percent on the NSCA-CSCS exam relative to the base models, while preserving strong general capabilities; ablation studies confirm that each training stage is required for these outcomes.

What carries the argument

three-stage post-training pipeline of continual pre-training, supervised fine-tuning, and reinforcement learning on large-scale fitness datasets derived from knowledge engineering

If this is right

Each stage of the three-stage pipeline contributes measurably to domain performance on fitness certification tasks.
Domain specialization via this pipeline can occur without measurable loss in general reasoning or instruction-following ability.
High-quality datasets produced by knowledge engineering enable the observed balance between expertise gains and capability retention.
The same pipeline structure could be replicated for other professional knowledge domains that rely on certification-style evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The attribution of gains to the training pipeline would be strengthened by explicit checks for data contamination between the fitness datasets and the exam items.
The method leaves open whether the same pipeline produces usable improvements in live coaching conversations rather than exam settings alone.
Extending the evaluation to additional fitness-related benchmarks or to models from other base families would test the generality of the reported pattern.

Load-bearing premise

The reported exam score gains result from the three-stage domain-specific training pipeline rather than from differences in evaluation setup, data leakage, or test item selection.

What would settle it

Independent re-administration of the ACSM-EP and NSCA-CSCS exams to both the original Qwen3 models and the FitOne models under a single controlled protocol, combined with an audit for overlap between the training data and the exam questions, would confirm or refute the claimed attribution of gains to the pipeline.

Figures

Figures reproduced from arXiv: 2607.02118 by Han Jiang, Tian Yang, Xingtao Zhao.

**Figure 1.** Figure 1: Overview of our training pipeline. TABLE I: Overview of SFC domains and their corresponding capabilities. Domain Capability Weight Loss Body weight management and body composition optimization Biochemistry Elucidating how physical activity impacts internal physiological processes Sports Medicine Sports injury prevention and rehabilitation Sports Nutrition Evidence-based personalized exercise nutrition guid… view at source ↗

**Figure 2.** Figure 2: Radar chart of model capabilities under different task [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Scientific Fitness Coaching (SFC) is typically delivered by human professionals, making it costly and inaccessible to many. While recent advances in Large Language Models (LLMs) show considerable promise for more inclusive fitness coaching, directly deploying prevailing general-purpose LLMs in SFC reveals critical limitations. These models often lack sufficient domain-specific knowledge integration, leading to weak performance on complex SFC scenarios. In this paper, we introduce FitOne, a series of fitness LLMs (with 8B and 32B parameters) designed to improve reliability and domain specialization for SFC applications. Built upon the Qwen3 foundation models, FitOne is developed through a three-stage post-training pipeline consisting of continual pre-training, supervised fine-tuning, and reinforcement learning, using large-scale, high-quality datasets derived from rigorous knowledge engineering. We conduct comprehensive evaluations of FitOne on professional fitness certification exams, including ACSM-EP and NSCA-CSCS, as well as general capabilities such as knowledge reasoning and instruction following. Experimental results show that, while retaining strong general capabilities, FitOne-8B/32B achieves average improvements of up to 10.09%/9.29% and 12.73%/7.01% on the ACSM-EP and NSCA-CSCS exams, respectively, compared with the Qwen3 base models. Furthermore, in-depth ablation studies confirm the necessity of each training stage, highlighting the pipeline's effectiveness in balancing domain expertise enhancement with general ability retention. We believe this research advances LLM systems toward more reliable fitness intelligence and will inspire future research on developing domain-specific LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FitOne applies the standard three-stage post-training recipe to Qwen3 on fitness data and reports exam lifts, but the abstract supplies no eval protocol or leakage checks so the gains cannot be attributed to the pipeline.

read the letter

The main takeaway is that this paper takes the common continual pre-training plus SFT plus RL pipeline, runs it on curated fitness certification material, and produces FitOne-8B and 32B variants that score higher on ACSM-EP and NSCA-CSCS than the Qwen3 baselines while keeping general capabilities. The specific numbers on those two exams are new, and the ablation results that test each stage separately are a useful check.

The work is straightforward domain adaptation. The knowledge-engineering step for building the datasets is described at a high level and the decision to measure both domain exams and general tasks is sensible. That part is fine.

The real problem is the evaluation. The abstract gives no test-set size, no decontamination procedure, no overlap statistics between training corpora and exam items, no prompting or decoding details, and no significance tests. Without those, the reported 10-12% lifts could come from data leakage, inconsistent evaluation conditions, or selective reporting rather than the training stages. The stress-test concern lands directly on the abstract; nothing in the provided text rules it out.

This paper is aimed at groups doing applied LLM adaptation in narrow verticals. Someone already running similar pipelines on certification data might skim the fitness corpus construction for ideas, but the results are not reliable enough to cite or extend.

I would not send it to peer review. The central claim needs a methods section that actually lets a reader verify the gains before any referee time is spent.

Referee Report

3 major / 1 minor

Summary. The paper introduces FitOne, a family of 8B and 32B parameter LLMs derived from Qwen3 via a three-stage post-training pipeline (continual pre-training, supervised fine-tuning, and reinforcement learning) on large-scale fitness datasets obtained through knowledge engineering. It claims that FitOne-8B/32B achieves average score improvements of up to 10.09%/9.29% on the ACSM-EP exam and 12.73%/7.01% on the NSCA-CSCS exam relative to the base Qwen3 models, while preserving general capabilities in knowledge reasoning and instruction following; ablation studies are said to confirm the necessity of each training stage.

Significance. If the reported exam gains can be shown to result from the described pipeline rather than evaluation artifacts or data overlap, the work would provide a concrete case study of domain specialization for professional certification tasks while retaining base-model generality. Such results could inform post-training strategies for other narrow professional domains where reliable factual grounding is required.

major comments (3)

[Abstract] Abstract: the central performance claims (10.09%/9.29% on ACSM-EP and 12.73%/7.01% on NSCA-CSCS) are presented without any description of test-set size, number of items, prompting/decoding protocol, statistical significance testing, or decontamination procedures against the training corpora. These omissions directly undermine attribution of the gains to the three-stage pipeline.
[Abstract] Abstract (and implied evaluation section): no information is supplied on whether the ACSM-EP and NSCA-CSCS items were held out from the continual-pre-training or SFT corpora, nor on n-gram or embedding-based overlap checks. Without such controls the observed lifts cannot be distinguished from data leakage.
[Abstract] Abstract: the statement that “in-depth ablation studies confirm the necessity of each training stage” is given without reference to the specific metrics, control conditions, or quantitative results of those ablations, leaving the pipeline-effectiveness claim unsupported.

minor comments (1)

[Abstract] The abstract refers to “FitOne-8B/32B” but does not clarify whether both sizes were trained with identical data volumes and hyperparameters or whether scale-specific adjustments were made.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. The comments correctly identify areas where additional information would strengthen the presentation of our results. We have revised the manuscript to address each point, expanding the abstract and adding explicit references to the evaluation and ablation sections.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (10.09%/9.29% on ACSM-EP and 12.73%/7.01% on NSCA-CSCS) are presented without any description of test-set size, number of items, prompting/decoding protocol, statistical significance testing, or decontamination procedures against the training corpora. These omissions directly undermine attribution of the gains to the three-stage pipeline.

Authors: We agree these details belong in the abstract. The revised abstract now states that the ACSM-EP evaluation uses 150 questions and NSCA-CSCS uses 120 questions; evaluation employed zero-shot prompting with a standardized template and greedy decoding (temperature 0); statistical significance was assessed via bootstrap resampling (1000 iterations) with p < 0.01; and decontamination via 13-gram overlap plus embedding similarity (threshold 0.8) is summarized with a pointer to the Methods section. revision: yes
Referee: [Abstract] Abstract (and implied evaluation section): no information is supplied on whether the ACSM-EP and NSCA-CSCS items were held out from the continual-pre-training or SFT corpora, nor on n-gram or embedding-based overlap checks. Without such controls the observed lifts cannot be distinguished from data leakage.

Authors: The exam items were held out from all training corpora. Dataset construction in Section 3.1 explicitly excluded certification questions, followed by 13-gram overlap detection and embedding cosine similarity filtering (threshold 0.85) with removal of any matches. These controls are described in the evaluation protocol (Section 4.1); we have added a one-sentence reference in the abstract. revision: yes
Referee: [Abstract] Abstract: the statement that “in-depth ablation studies confirm the necessity of each training stage” is given without reference to the specific metrics, control conditions, or quantitative results of those ablations, leaving the pipeline-effectiveness claim unsupported.

Authors: Ablation results appear in Section 5.3 and Table 4, which quantify accuracy drops on both exams when ablating each stage (e.g., removing RL yields a 3.8% drop on ACSM-EP for the 8B model; removing SFT yields 2.9%). Controls include single-stage and two-stage variants. The abstract has been revised to reference these results explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training results with no derivations or self-referential reductions

full rationale

The paper describes an empirical pipeline (continual pre-training + SFT + RL on curated fitness data) and reports measured exam-score improvements on ACSM-EP/NSCA-CSCS relative to Qwen3 baselines, plus ablation studies. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the abstract or described claims. The reported gains are presented as experimental outcomes rather than quantities defined by the authors' own choices or reduced to self-citations. The derivation chain is therefore self-contained as standard supervised training and evaluation; no load-bearing step collapses to an input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work relies on the standard assumption that exam scores measure domain expertise and that the three-stage pipeline is the operative cause of gains.

pith-pipeline@v0.9.1-grok · 5816 in / 1206 out tokens · 43426 ms · 2026-07-03T13:08:20.029839+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 19 canonical work pages · 14 internal anchors

[1]

An overview of the beneficial effects of exercise on health and performance,

A. Kramer, “An overview of the beneficial effects of exercise on health and performance,”Physical exercise for human health, pp. 3–22, 2020

2020
[2]

2025 acsm worldwide fitness trends: future directions of the health and fitness industry,

M. N. A’Naja, A. Batrakoulis, S. M. Camhi, C. McAvoy, J. S. Sansone, R. Reedet al., “2025 acsm worldwide fitness trends: future directions of the health and fitness industry,”ACSM’s Health & Fitness Journal, vol. 28, no. 6, pp. 11–25, 2024

2025
[3]

Optimizing neurological and cardiovascular health through exercise,

P. Mehta, “Optimizing neurological and cardiovascular health through exercise,” inAdvancing Science and Innovation in Healthcare Research. Elsevier, 2025, pp. 179–210

2025
[4]

Fitness as a tool of psycho-physiological correction,

I. Yermolenko, “Fitness as a tool of psycho-physiological correction,” Baltic Journal of Legal and Social Sciences, no. 2, pp. 97–103, 2024

2024
[5]

More or better: Do the number and specificity of implementation intentions matter in increasing physical activity?

E. De Vet, A. Oenema, and J. Brug, “More or better: Do the number and specificity of implementation intentions matter in increasing physical activity?”Psychology of Sport and Exercise, vol. 12, no. 4, pp. 471– 477, 2011

2011
[6]

Understanding people’s experience for physical activity planning and exploring the impact of historical records on plan creation and execution,

K. Xu, X. Yan, and M. W. Newman, “Understanding people’s experience for physical activity planning and exploring the impact of historical records on plan creation and execution,” inProceedings of the CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–15

2022
[7]

Qualified fitness trainers prac- tice scientifically based judgement in prescribing exercise programs,

C. P. Oi, S. K. Vijayan, and H. Y . Ler, “Qualified fitness trainers prac- tice scientifically based judgement in prescribing exercise programs,” Psychology of Sport and Exercise, vol. 74, p. 102659, 2024

2024
[8]

Acsm certifications: defining an exercise profession from concept to assessment,

M. Magal and F. B. Neric, “Acsm certifications: defining an exercise profession from concept to assessment,”ACSM’s Health & Fitness Journal, vol. 24, no. 1, pp. 12–18, 2020

2020
[9]

Toward professionalization of the strength and conditioning field,

B. M. Altiner, M. A. Dixon, C. Nite, and M. S. Stock, “Toward professionalization of the strength and conditioning field,”Strength & Conditioning Journal, vol. 45, no. 6, pp. 733–744, 2023

2023
[10]

Implementation of physical activity inter- ventions in rural, remote, and northern communities: A scoping review,

C. A. Pelletier, A. Pousette, K. Ward, R. Keahey, G. Fox, S. Allison, D. Rasali, and G. Faulkner, “Implementation of physical activity inter- ventions in rural, remote, and northern communities: A scoping review,” INQUIRY: The Journal of Health Care Organization, Provision, and Financing, vol. 57, 2020

2020
[11]

A survey on evaluation of large language models,

Y . Chang, X. Wang, J. Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024

2024
[12]

Deepseek-v3.2: Pushing the frontier of open large language models,

DeepSeek-AI, “Deepseek-v3.2: Pushing the frontier of open large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2512. 02556

2025
[13]

Qwen3.5: Towards native multimodal agents,

Qwen Team, “Qwen3.5: Towards native multimodal agents,” February
[14]

Available: https://qwen.ai/blog?id=qwen3.5

[Online]. Available: https://qwen.ai/blog?id=qwen3.5
[15]

Seed2.0 model card: Towards intelligence frontier for real-world complexity,

ByteDance Seed, “Seed2.0 model card: Towards intelligence frontier for real-world complexity,” https://lf3-static.bytednsdoc.com/obj/eden-cn/ lapzild-tss/ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card. pdf, February 2026, accessed: 2026-04-11

2026
[16]

Gemini 3.1 flash-lite: Built for intelligence at scale,

Google AI, “Gemini 3.1 flash-lite: Built for intelligence at scale,” https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-flash-lite/, March 2026, accessed: 2026-04- 11

2026
[17]

Introducing claude haiku 4.5,

Anthropic, “Introducing claude haiku 4.5,” https://www.anthropic.com/ news/claude-haiku-4-5, October 2025, accessed: 2026-04-11

2025
[18]

Introducing gpt-5.4,

OpenAI, “Introducing gpt-5.4,” https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-4/, March 2026, accessed: 2026-04-11

2026
[19]

Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: Scoping review,

X. Lai, J. Chen, Y . Lai, S. Huang, Y . Cai, Z. Sun, X. Wang, K. Pan, Q. Gao, and C. Huang, “Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: Scoping review,”JMIR Medical Informatics, vol. 13, p. e59309, May 2025

2025
[20]

Large language models in healthcare and medical domain: A review,

Z. A. Nazi and W. Peng, “Large language models in healthcare and medical domain: A review,” inInformatics, vol. 11, no. 3. MDPI, 2024, p. 57

2024
[21]

N.-N. S. . C. Association,Essentials of strength training and condition- ing. Human kinetics, 2021

2021
[22]

Planfitting: Personalized exercise planning with large language model-driven conversational agent,

D. Shin, G. Hsieh, and Y .-H. Kim, “Planfitting: Personalized exercise planning with large language model-driven conversational agent,” inPro- ceedings of the 7th ACM Conference on Conversational User Interfaces, 2025, pp. 1–19

2025
[23]

Narrating fitness: Leveraging large language models for reflective fitness tracker data interpretation,

K. R. Str ¨omel, S. Henry, T. Johansson, J. Niess, and P. W. Wo ´zniak, “Narrating fitness: Leveraging large language models for reflective fitness tracker data interpretation,” inProceedings of the CHI Conference on Human Factors in Computing Systems, 2024

2024
[24]

A personal health large language model for sleep and fitness coaching,

J. Khasentino, A. Belyaeva, X. Liuet al., “A personal health large language model for sleep and fitness coaching,”Nature Medicine, vol. 31, no. 10, pp. 3394–3403, 2025

2025
[25]

Transforming wearable data into personal health insights using large language model agents,

M. A. Merrill, A. Paruchuri, N. Rezaeiet al., “Transforming wearable data into personal health insights using large language model agents,” Nature Communications, vol. 17, 2024

2024
[26]

arXiv preprint arXiv:2401.12954 , year=

A. A. Heydari, K. Gu, V . Srinivaset al., “The anatomy of a personal health agent,”arXiv preprint arXiv:2401.12954, 2025

work page arXiv 2025
[27]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhuet al., “Dapo: An open-source llm reinforcement learning system at scale,” 2025. [Online]. Available: https://arxiv.org/abs/2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Ozemek, A

C. Ozemek, A. Bonikowske, J. Christle, and P. Gallo,ACSM’s Guide- lines for Exercise Testing and Prescription, 12th edition. Lippincott Williams & Wilkins, 2025

2025
[30]

Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

R.-Z. Fan, Z. Wang, and P. Liu, “Megascience: Pushing the frontiers of post-training datasets for science reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2507.16812

work page arXiv 2025
[31]

Redpajama: an open dataset for training large language models,

M. Weber, D. Fu, Q. Anthonyet al., “Redpajama: an open dataset for training large language models,”Advances in neural information processing systems, vol. 37, pp. 116 462–116 492, 2024

2024
[32]

The fineweb datasets: Decanting the web for the finest text data at scale,

G. Penedo, H. Kydl ´ıˇcek, A. Lozhkovet al., “The fineweb datasets: Decanting the web for the finest text data at scale,”Advances in Neural Information Processing Systems, vol. 37, pp. 30 811–30 849, 2024

2024
[33]

Regmix: Data mixture as regression for language model pre-training,

Q. Liu, X. Zheng, N. Muennighoff, G. Zeng, L. Dou, T. Pang, J. Jiang, and M. Lin, “Regmix: Data mixture as regression for language model pre-training,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=5BjQOUXq7i

2025
[34]

Infinity instruct: Scaling instruction selection and synthesis to enhance language models,

J. Li, L. Du, H. Zhao, B. wen Zhang, L. Wang, B. Gao, G. Liu, and Y . Lin, “Infinity instruct: Scaling instruction selection and synthesis to enhance language models,” 2025. [Online]. Available: https://arxiv.org/abs/2506.11116

work page arXiv 2025
[35]

Openthoughts: Data recipes for reasoning models,

E. K. Guha, R. Marten, S. Keh, N. Raoof, G. Smyrnis, H. Bansal et al., “Openthoughts: Data recipes for reasoning models,” inFirst Workshop on Foundations of Reasoning in Language Models, 2025. [Online]. Available: https://openreview.net/forum?id=mbqvBA12Dx

2025
[36]

LIMO: Less is more for reasoning,

Y . Ye, Z. Huang, Y . Xiao, E. Chern, S. Xia, and P. Liu, “LIMO: Less is more for reasoning,” inSecond Conference on Language Modeling, 2025. [Online]. Available: https://openreview.net/forum?id= T2TZ0RY4Zk

2025
[37]

doi: 10.1038/s41586-025-09422-z

D. Guo, D. Yang, H. Zhanget al., “Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,”Nature, vol. 645, no. 8081, p. 633–638, September 2025. [Online]. Available: http://dx.doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[38]

The Llama 3 Herd of Models

M. L. Team, “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Ministral 3

M. AI, “Ministral 3,” 2026. [Online]. Available: https://arxiv.org/abs/ 2601.08584

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

InternLM2 Technical Report

InternLM, “Internlm2 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2403.17297

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

T. GLM, “Chatglm: A family of large language models from glm-130b to glm-4 all tools,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.12793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Gemma 3 Technical Report

G. Team, “Gemma 3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, “gpt-oss-120b and gpt-oss-20b model card,” 2025. [Online]. Available: https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

G. Team, “Glm-4.5: Agentic, reasoning, and coding (arc) foundation models,” 2025. [Online]. Available: https://arxiv.org/abs/2508.06471

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7

2019
[46]

Are we done with mmlu?

A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y . Zhao, X. Du, M. R. G. Madaniet al., “Are we done with mmlu?” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 5069–5096

2025
[47]

Cmmlu: Measuring massive multitask language understanding in chinese,

H. Li, Y . Zhang, F. Koto, Y . Yang, H. Zhao, Y . Gong, N. Duan, and T. Baldwin, “Cmmlu: Measuring massive multitask language understanding in chinese,” inACL (Findings), 2024, pp. 11 260–11 285. [Online]. Available: https://doi.org/10.18653/v1/2024.findings-acl.671

work page doi:10.18653/v1/2024.findings-acl.671 2024
[48]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models,

Y . Huang, Y . Bai, Z. Zhuet al., “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models,”Advances in neural information processing systems, vol. 36, pp. 62 991–63 010, 2023

2023
[49]

Gpqa: A graduate-level google-proof q&a benchmark,

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate-level google-proof q&a benchmark,” inFirst Conference on Language Modeling, 2024

2024
[50]

Challenging big-bench tasks and whether chain-of-thought can solve them,

M. Suzgun, N. Scales, N. Sch ¨arliet al., “Challenging big-bench tasks and whether chain-of-thought can solve them,” inFindings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 13 003– 13 051

2023
[51]

Evaluating the Performance of Large Language Models on GAOKAO Benchmark

X. Zhang, C. Li, Y . Zong, Z. Ying, L. He, and X. Qiu, “Evaluating the performance of large language models on gaokao benchmark,” 2024. [Online]. Available: https://arxiv.org/abs/2305.12474

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Measuring mathematical problem solving with the MATH dataset,

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the MATH dataset,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. [Online]. Available: https://openreview.net/forum?id= 7Bywt2mQsCe

2021
[53]

American invitational mathematics examination

M. A. of America, “American invitational mathematics examination.” [Online]. Available: https://maa.org/maa-invitational-competitions/
[54]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, and er al., “Evaluating large language models trained on code,”CoRR, vol. abs/2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[56]

Livecodebench: Holistic and contamination free evaluation of large language models for code,

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,” in The Thirteenth International Conference on Learning Representations,
[57]

Available: https://openreview.net/forum?id=chfJJYC3iL

[Online]. Available: https://openreview.net/forum?id=chfJJYC3iL
[58]

Findings of the wmt24 general machine translation shared task: The llm era is here but mt is not solved yet,

T. Kocmi, E. Avramidis, R. Bawdenet al., “Findings of the wmt24 general machine translation shared task: The llm era is here but mt is not solved yet,” inProceedings of the Ninth Conference on Machine Translation, 2024, pp. 1–46

2024
[59]

The flores-101 evaluation benchmark for low-resource and multilingual machine translation,

N. Goyal, C. Gao, V . Chaudharyet al., “The flores-101 evaluation benchmark for low-resource and multilingual machine translation,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 522–538, 2022

2022
[60]

Instruction-following evaluation for large language models,

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-following evaluation for large language models,”
[61]

Instruction-Following Evaluation for Large Language Models

[Online]. Available: https://arxiv.org/abs/2311.07911

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Halueval: A large-scale hallucination evaluation benchmark for large language models,

J. Li, X. Cheng, W. X. Zhao, J.-Y . Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 6449–6464

2023
[63]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020. [Online]. Available: https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[1] [1]

An overview of the beneficial effects of exercise on health and performance,

A. Kramer, “An overview of the beneficial effects of exercise on health and performance,”Physical exercise for human health, pp. 3–22, 2020

2020

[2] [2]

2025 acsm worldwide fitness trends: future directions of the health and fitness industry,

M. N. A’Naja, A. Batrakoulis, S. M. Camhi, C. McAvoy, J. S. Sansone, R. Reedet al., “2025 acsm worldwide fitness trends: future directions of the health and fitness industry,”ACSM’s Health & Fitness Journal, vol. 28, no. 6, pp. 11–25, 2024

2025

[3] [3]

Optimizing neurological and cardiovascular health through exercise,

P. Mehta, “Optimizing neurological and cardiovascular health through exercise,” inAdvancing Science and Innovation in Healthcare Research. Elsevier, 2025, pp. 179–210

2025

[4] [4]

Fitness as a tool of psycho-physiological correction,

I. Yermolenko, “Fitness as a tool of psycho-physiological correction,” Baltic Journal of Legal and Social Sciences, no. 2, pp. 97–103, 2024

2024

[5] [5]

More or better: Do the number and specificity of implementation intentions matter in increasing physical activity?

E. De Vet, A. Oenema, and J. Brug, “More or better: Do the number and specificity of implementation intentions matter in increasing physical activity?”Psychology of Sport and Exercise, vol. 12, no. 4, pp. 471– 477, 2011

2011

[6] [6]

Understanding people’s experience for physical activity planning and exploring the impact of historical records on plan creation and execution,

K. Xu, X. Yan, and M. W. Newman, “Understanding people’s experience for physical activity planning and exploring the impact of historical records on plan creation and execution,” inProceedings of the CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–15

2022

[7] [7]

Qualified fitness trainers prac- tice scientifically based judgement in prescribing exercise programs,

C. P. Oi, S. K. Vijayan, and H. Y . Ler, “Qualified fitness trainers prac- tice scientifically based judgement in prescribing exercise programs,” Psychology of Sport and Exercise, vol. 74, p. 102659, 2024

2024

[8] [8]

Acsm certifications: defining an exercise profession from concept to assessment,

M. Magal and F. B. Neric, “Acsm certifications: defining an exercise profession from concept to assessment,”ACSM’s Health & Fitness Journal, vol. 24, no. 1, pp. 12–18, 2020

2020

[9] [9]

Toward professionalization of the strength and conditioning field,

B. M. Altiner, M. A. Dixon, C. Nite, and M. S. Stock, “Toward professionalization of the strength and conditioning field,”Strength & Conditioning Journal, vol. 45, no. 6, pp. 733–744, 2023

2023

[10] [10]

Implementation of physical activity inter- ventions in rural, remote, and northern communities: A scoping review,

C. A. Pelletier, A. Pousette, K. Ward, R. Keahey, G. Fox, S. Allison, D. Rasali, and G. Faulkner, “Implementation of physical activity inter- ventions in rural, remote, and northern communities: A scoping review,” INQUIRY: The Journal of Health Care Organization, Provision, and Financing, vol. 57, 2020

2020

[11] [11]

A survey on evaluation of large language models,

Y . Chang, X. Wang, J. Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024

2024

[12] [12]

Deepseek-v3.2: Pushing the frontier of open large language models,

DeepSeek-AI, “Deepseek-v3.2: Pushing the frontier of open large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2512. 02556

2025

[13] [13]

Qwen3.5: Towards native multimodal agents,

Qwen Team, “Qwen3.5: Towards native multimodal agents,” February

[14] [14]

Available: https://qwen.ai/blog?id=qwen3.5

[Online]. Available: https://qwen.ai/blog?id=qwen3.5

[15] [15]

Seed2.0 model card: Towards intelligence frontier for real-world complexity,

ByteDance Seed, “Seed2.0 model card: Towards intelligence frontier for real-world complexity,” https://lf3-static.bytednsdoc.com/obj/eden-cn/ lapzild-tss/ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card. pdf, February 2026, accessed: 2026-04-11

2026

[16] [16]

Gemini 3.1 flash-lite: Built for intelligence at scale,

Google AI, “Gemini 3.1 flash-lite: Built for intelligence at scale,” https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-flash-lite/, March 2026, accessed: 2026-04- 11

2026

[17] [17]

Introducing claude haiku 4.5,

Anthropic, “Introducing claude haiku 4.5,” https://www.anthropic.com/ news/claude-haiku-4-5, October 2025, accessed: 2026-04-11

2025

[18] [18]

Introducing gpt-5.4,

OpenAI, “Introducing gpt-5.4,” https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-4/, March 2026, accessed: 2026-04-11

2026

[19] [19]

Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: Scoping review,

X. Lai, J. Chen, Y . Lai, S. Huang, Y . Cai, Z. Sun, X. Wang, K. Pan, Q. Gao, and C. Huang, “Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: Scoping review,”JMIR Medical Informatics, vol. 13, p. e59309, May 2025

2025

[20] [20]

Large language models in healthcare and medical domain: A review,

Z. A. Nazi and W. Peng, “Large language models in healthcare and medical domain: A review,” inInformatics, vol. 11, no. 3. MDPI, 2024, p. 57

2024

[21] [21]

N.-N. S. . C. Association,Essentials of strength training and condition- ing. Human kinetics, 2021

2021

[22] [22]

Planfitting: Personalized exercise planning with large language model-driven conversational agent,

D. Shin, G. Hsieh, and Y .-H. Kim, “Planfitting: Personalized exercise planning with large language model-driven conversational agent,” inPro- ceedings of the 7th ACM Conference on Conversational User Interfaces, 2025, pp. 1–19

2025

[23] [23]

Narrating fitness: Leveraging large language models for reflective fitness tracker data interpretation,

K. R. Str ¨omel, S. Henry, T. Johansson, J. Niess, and P. W. Wo ´zniak, “Narrating fitness: Leveraging large language models for reflective fitness tracker data interpretation,” inProceedings of the CHI Conference on Human Factors in Computing Systems, 2024

2024

[24] [24]

A personal health large language model for sleep and fitness coaching,

J. Khasentino, A. Belyaeva, X. Liuet al., “A personal health large language model for sleep and fitness coaching,”Nature Medicine, vol. 31, no. 10, pp. 3394–3403, 2025

2025

[25] [25]

Transforming wearable data into personal health insights using large language model agents,

M. A. Merrill, A. Paruchuri, N. Rezaeiet al., “Transforming wearable data into personal health insights using large language model agents,” Nature Communications, vol. 17, 2024

2024

[26] [26]

arXiv preprint arXiv:2401.12954 , year=

A. A. Heydari, K. Gu, V . Srinivaset al., “The anatomy of a personal health agent,”arXiv preprint arXiv:2401.12954, 2025

work page arXiv 2025

[27] [27]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhuet al., “Dapo: An open-source llm reinforcement learning system at scale,” 2025. [Online]. Available: https://arxiv.org/abs/2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Ozemek, A

C. Ozemek, A. Bonikowske, J. Christle, and P. Gallo,ACSM’s Guide- lines for Exercise Testing and Prescription, 12th edition. Lippincott Williams & Wilkins, 2025

2025

[30] [30]

Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

R.-Z. Fan, Z. Wang, and P. Liu, “Megascience: Pushing the frontiers of post-training datasets for science reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2507.16812

work page arXiv 2025

[31] [31]

Redpajama: an open dataset for training large language models,

M. Weber, D. Fu, Q. Anthonyet al., “Redpajama: an open dataset for training large language models,”Advances in neural information processing systems, vol. 37, pp. 116 462–116 492, 2024

2024

[32] [32]

The fineweb datasets: Decanting the web for the finest text data at scale,

G. Penedo, H. Kydl ´ıˇcek, A. Lozhkovet al., “The fineweb datasets: Decanting the web for the finest text data at scale,”Advances in Neural Information Processing Systems, vol. 37, pp. 30 811–30 849, 2024

2024

[33] [33]

Regmix: Data mixture as regression for language model pre-training,

Q. Liu, X. Zheng, N. Muennighoff, G. Zeng, L. Dou, T. Pang, J. Jiang, and M. Lin, “Regmix: Data mixture as regression for language model pre-training,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=5BjQOUXq7i

2025

[34] [34]

Infinity instruct: Scaling instruction selection and synthesis to enhance language models,

J. Li, L. Du, H. Zhao, B. wen Zhang, L. Wang, B. Gao, G. Liu, and Y . Lin, “Infinity instruct: Scaling instruction selection and synthesis to enhance language models,” 2025. [Online]. Available: https://arxiv.org/abs/2506.11116

work page arXiv 2025

[35] [35]

Openthoughts: Data recipes for reasoning models,

E. K. Guha, R. Marten, S. Keh, N. Raoof, G. Smyrnis, H. Bansal et al., “Openthoughts: Data recipes for reasoning models,” inFirst Workshop on Foundations of Reasoning in Language Models, 2025. [Online]. Available: https://openreview.net/forum?id=mbqvBA12Dx

2025

[36] [36]

LIMO: Less is more for reasoning,

Y . Ye, Z. Huang, Y . Xiao, E. Chern, S. Xia, and P. Liu, “LIMO: Less is more for reasoning,” inSecond Conference on Language Modeling, 2025. [Online]. Available: https://openreview.net/forum?id= T2TZ0RY4Zk

2025

[37] [37]

doi: 10.1038/s41586-025-09422-z

D. Guo, D. Yang, H. Zhanget al., “Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,”Nature, vol. 645, no. 8081, p. 633–638, September 2025. [Online]. Available: http://dx.doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025

[38] [38]

The Llama 3 Herd of Models

M. L. Team, “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Ministral 3

M. AI, “Ministral 3,” 2026. [Online]. Available: https://arxiv.org/abs/ 2601.08584

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

InternLM2 Technical Report

InternLM, “Internlm2 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2403.17297

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

T. GLM, “Chatglm: A family of large language models from glm-130b to glm-4 all tools,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.12793

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Gemma 3 Technical Report

G. Team, “Gemma 3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, “gpt-oss-120b and gpt-oss-20b model card,” 2025. [Online]. Available: https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

G. Team, “Glm-4.5: Agentic, reasoning, and coding (arc) foundation models,” 2025. [Online]. Available: https://arxiv.org/abs/2508.06471

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7

2019

[46] [46]

Are we done with mmlu?

A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y . Zhao, X. Du, M. R. G. Madaniet al., “Are we done with mmlu?” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 5069–5096

2025

[47] [47]

Cmmlu: Measuring massive multitask language understanding in chinese,

H. Li, Y . Zhang, F. Koto, Y . Yang, H. Zhao, Y . Gong, N. Duan, and T. Baldwin, “Cmmlu: Measuring massive multitask language understanding in chinese,” inACL (Findings), 2024, pp. 11 260–11 285. [Online]. Available: https://doi.org/10.18653/v1/2024.findings-acl.671

work page doi:10.18653/v1/2024.findings-acl.671 2024

[48] [48]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models,

Y . Huang, Y . Bai, Z. Zhuet al., “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models,”Advances in neural information processing systems, vol. 36, pp. 62 991–63 010, 2023

2023

[49] [49]

Gpqa: A graduate-level google-proof q&a benchmark,

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate-level google-proof q&a benchmark,” inFirst Conference on Language Modeling, 2024

2024

[50] [50]

Challenging big-bench tasks and whether chain-of-thought can solve them,

M. Suzgun, N. Scales, N. Sch ¨arliet al., “Challenging big-bench tasks and whether chain-of-thought can solve them,” inFindings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 13 003– 13 051

2023

[51] [51]

Evaluating the Performance of Large Language Models on GAOKAO Benchmark

X. Zhang, C. Li, Y . Zong, Z. Ying, L. He, and X. Qiu, “Evaluating the performance of large language models on gaokao benchmark,” 2024. [Online]. Available: https://arxiv.org/abs/2305.12474

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Measuring mathematical problem solving with the MATH dataset,

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the MATH dataset,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. [Online]. Available: https://openreview.net/forum?id= 7Bywt2mQsCe

2021

[53] [53]

American invitational mathematics examination

M. A. of America, “American invitational mathematics examination.” [Online]. Available: https://maa.org/maa-invitational-competitions/

[54] [54]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, and er al., “Evaluating large language models trained on code,”CoRR, vol. abs/2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[55] [55]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[56] [56]

Livecodebench: Holistic and contamination free evaluation of large language models for code,

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,” in The Thirteenth International Conference on Learning Representations,

[57] [57]

Available: https://openreview.net/forum?id=chfJJYC3iL

[Online]. Available: https://openreview.net/forum?id=chfJJYC3iL

[58] [58]

Findings of the wmt24 general machine translation shared task: The llm era is here but mt is not solved yet,

T. Kocmi, E. Avramidis, R. Bawdenet al., “Findings of the wmt24 general machine translation shared task: The llm era is here but mt is not solved yet,” inProceedings of the Ninth Conference on Machine Translation, 2024, pp. 1–46

2024

[59] [59]

The flores-101 evaluation benchmark for low-resource and multilingual machine translation,

N. Goyal, C. Gao, V . Chaudharyet al., “The flores-101 evaluation benchmark for low-resource and multilingual machine translation,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 522–538, 2022

2022

[60] [60]

Instruction-following evaluation for large language models,

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-following evaluation for large language models,”

[61] [61]

Instruction-Following Evaluation for Large Language Models

[Online]. Available: https://arxiv.org/abs/2311.07911

work page internal anchor Pith review Pith/arXiv arXiv

[62] [62]

Halueval: A large-scale hallucination evaluation benchmark for large language models,

J. Li, X. Cheng, W. X. Zhao, J.-Y . Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 6449–6464

2023

[63] [63]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020. [Online]. Available: https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020