arxiv: 2604.07190 · v1 · submitted 2026-04-08 · 💻 cs.CY · cs.AI· cs.LG

Recognition: unknown

The ATOM Report: Measuring the Open Language Model Ecosystem

Florian Brand, Nathan Lambert

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:29 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LG

keywords open language modelsadoption trendsChinese modelsHugging Face downloadsinference market shareLLM ecosystemmodel derivativesperformance benchmarks

0 comments

The pith

Chinese open language models overtook U.S. models in summer 2025 and have since widened the gap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The report assembles a broad snapshot of the roughly 1,500 leading open language models, identifying which organizations are building them and how widely they are used. It tracks a reversal in which models from Chinese groups such as Alibaba's Qwen and DeepSeek moved ahead of U.S.-built models like Meta's Llama series on multiple measures of adoption. The shift is documented through download volume on Hugging Face, the spread of model derivatives, shares of inference traffic, and benchmark results. A reader would care because these open models now serve as the shared foundation for most current research, commercial applications, and policy debates around artificial intelligence.

Core claim

Combining Hugging Face downloads, model derivatives, inference market share, and performance metrics shows that Chinese open language models overtook their U.S. counterparts in the summer of 2025 and have subsequently increased their lead over Western models.

What carries the argument

Multi-indicator adoption snapshot that aggregates download counts, derivative creation, inference usage, and benchmark scores across the set of approximately 1,500 mainline open models.

If this is right

Leadership in open model development has shifted toward Chinese organizations.
Researchers and startups will increasingly build on Chinese-origin foundations.
Policy discussions must incorporate this change when addressing technology competition and access.
The open ecosystem is no longer centered on a single national source of models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continued growth in Chinese model usage could make future AI research and applications more dependent on non-Western infrastructure.
The same multi-metric tracking approach could be applied to other model categories such as vision or multimodal systems to detect similar geographic shifts.
These adoption figures may influence decisions on international standards or collaboration rules for open AI tools.

Load-bearing premise

The chosen combination of download counts, derivative activity, inference share, and performance scores accurately reflects real-world adoption and influence without systematic bias.

What would settle it

Independent usage data from sources outside Hugging Face and the listed metrics showing that U.S. models retained majority adoption or influence after summer 2025 would undermine the overtake claim.

Figures

Figures reproduced from arXiv: 2604.07190 by Florian Brand, Nathan Lambert.

**Figure 2.** Figure 2: Distribution of our tracked open model downloads by parameter count. The 7–9B range [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Share of new model derivatives by region per month. EU share fell from 58% in January [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Open model inference token share by region (OpenRouter). China and the US completely [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Arena Elo ratings for top open models by region. China surpassed the US in December [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Artificial Analysis Overall Intelligence Index by region. China’s best open model score [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Cumulative downloads for leading open model families. Qwen surpassed Llama in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Monthly share of new model derivatives by organization. Qwen’s derivative share reached [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Open model inference token share by organization (OpenRouter). Meta fell from a 37.4% [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Monthly Qwen downloads compared to combined downloads from all other major [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Monthly downloads for selected Qwen 3 models, including their quantized variants [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Download share by model size for DeepSeek vs. Qwen. DeepSeek dominates the 250B+ [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Cumulative downloads from new American open model entrants since August 2025. By [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Cumulative downloads since July 2025 for organizations competing with DeepSeek. [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: RAM trajectories for notable recent model releases. Each line shows the RAM multiplier [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: RAM trajectories for small and medium models (1–5B and 10–50B buckets). DeepSeek [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: RAM reference curves showing median cumulative downloads (with IQR) for the top [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

read the original abstract

We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline open models from the likes of Alibaba's Qwen, DeepSeek, Meta's Llama, that are the foundation of an ecosystem crucial to researchers, entrepreneurs, and policy advisors. We document a clear trend where Chinese models overtook their counterparts built in the U.S. in the summer of 2025 and subsequently widened the gap over their western counterparts. We study a mix of Hugging Face downloads and model derivatives, inference market share, performance metrics and more to make a comprehensive picture of the ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The report flags a 2025 crossover where Chinese open models pulled ahead on HF downloads and related signals, but the HF-centric sampling leaves the timing claim vulnerable to regional platform differences.

read the letter

The key point for you is that this measurement report documents Chinese open models overtaking US ones around summer 2025 across downloads, derivatives, inference share, and benchmarks, with the gap then widening. That specific timing and multi-signal snapshot is the new empirical content here, even if the general idea of tracking open-model activity on public hubs is not novel. The authors compile a useful overview of the ~1.5k mainline models from players like Qwen, DeepSeek, and Llama, which gives a quick picture of who is building what in the open ecosystem. That compilation is the part that actually adds value for someone scanning trends. The work is straightforward in its approach and sticks to observable public data rather than trying to derive new theory. The soft spots are mostly around verification and coverage. The central overtake claim rests on a composite that includes heavy Hugging Face usage, yet the abstract and available details give no indication of adjustments for Chinese domestic platforms like ModelScope or local clouds that might not route through HF. Without data-cleaning rules, selection criteria for the 1.5k sample, or cross-validation against non-HF sources, it is hard to know how much the summer 2025 date reflects real adoption versus platform preference. Inference market share measurement is also left opaque. These are not fatal gaps for a snapshot report, but they are load-bearing for the temporal claim. This is the sort of piece that could fit a reading group focused on AI ecosystem tracking or policy inputs, especially if the full methods section fills in the blanks. I would not cite the numbers yet without seeing those details, but the paper is coherent enough on its own terms to deserve a serious referee who can press on the sampling and measurement choices.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a comprehensive adoption snapshot of the leading open language model ecosystem, focusing on the ~1.5K mainline open models from developers including Alibaba's Qwen, DeepSeek, and Meta's Llama. It documents a clear trend in which Chinese models overtook their U.S. counterparts in summer 2025 and subsequently widened the gap, derived from a composite of Hugging Face downloads, model derivatives, inference market share, performance metrics, and related indicators.

Significance. If the reported temporal overtake and gap-widening hold after methodological clarification, the work supplies timely empirical data on shifting global open-model influence with direct relevance to AI policy, research prioritization, and ecosystem monitoring. The multi-indicator approach is a positive feature for a measurement study, though its value hinges on transparent handling of sampling and regional biases.

major comments (3)

[Abstract] Abstract: The central claim of Chinese models overtaking U.S. models in summer 2025 is presented without any description of data-cleaning rules, inclusion criteria for the ~1.5K model sample, or adjustments for selection bias; these omissions make the trend unverifiable from the provided information.
[Abstract] Abstract (inference market share component): No details are given on how inference market share was measured or whether the metric incorporates usage outside Hugging Face (e.g., ModelScope, Chinese cloud platforms, or local deployments); without such cross-validation the overtake timing and subsequent widening are sensitive to platform-specific sampling bias.
[Abstract] Abstract (composite metrics): The paper states that a 'mix' of downloads, derivatives, inference share, and performance metrics yields the comprehensive picture, yet provides no information on weighting, aggregation rules, or robustness checks; this is load-bearing for interpreting the claimed gap-widening.

minor comments (2)

[Abstract] Abstract: The qualifier 'and more' is imprecise; enumerate the full set of indicators used to construct the ecosystem picture.
[Throughout] Terminology: 'Western counterparts' and 'U.S. models' appear to be used interchangeably; adopt consistent geographic labeling throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We agree that greater transparency is needed in the abstract regarding methodology and have revised the manuscript accordingly to strengthen verifiability while preserving the high-level summary. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of Chinese models overtaking U.S. models in summer 2025 is presented without any description of data-cleaning rules, inclusion criteria for the ~1.5K model sample, or adjustments for selection bias; these omissions make the trend unverifiable from the provided information.

Authors: The full manuscript details the sample construction in Section 2 (Data Sources and Scope), including explicit inclusion criteria focused on mainline models from leading developers (Qwen, DeepSeek, Llama and similar), exclusion of minor fine-tunes and duplicates, and basic data-cleaning steps such as removing inactive repositories. Potential selection biases (e.g., toward English-language metadata on Hugging Face) are discussed in the limitations subsection. To address the abstract-level concern, we have added a concise clause referencing the ~1.5K mainline model sample and directing readers to the methods for full criteria and cleaning rules. revision: yes
Referee: [Abstract] Abstract (inference market share component): No details are given on how inference market share was measured or whether the metric incorporates usage outside Hugging Face (e.g., ModelScope, Chinese cloud platforms, or local deployments); without such cross-validation the overtake timing and subsequent widening are sensitive to platform-specific sampling bias.

Authors: Inference share is derived from Hugging Face's public inference endpoint usage statistics and download volume as the primary observable proxy for open-model adoption. We acknowledge that this metric does not incorporate usage on ModelScope, Chinese cloud providers, or fully local deployments, which could understate Chinese model activity. We have expanded the limitations section to explicitly note this platform-specific bias and have added a robustness note that the overtake trend appears consistently across the other non-inference indicators (downloads, derivatives, benchmarks). Full cross-validation with non-public platform data is not possible with publicly available sources. revision: partial
Referee: [Abstract] Abstract (composite metrics): The paper states that a 'mix' of downloads, derivatives, inference share, and performance metrics yields the comprehensive picture, yet provides no information on weighting, aggregation rules, or robustness checks; this is load-bearing for interpreting the claimed gap-widening.

Authors: The composite view is a qualitative convergence of independent indicators rather than a single weighted index; each metric is reported separately in the results section, and the gap-widening claim rests on the alignment of trends across them. We have revised the abstract and added an explicit paragraph in Section 4 describing this narrative-synthesis approach, along with per-indicator robustness plots that confirm the overtake timing holds when any single metric is removed. No formal weighting scheme is used, which we now state clearly to avoid implying quantitative precision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement report with no derivation chain

full rationale

The paper is a data-driven measurement report that aggregates Hugging Face downloads, model derivatives, inference market share, and performance metrics to document adoption trends. It contains no equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations that reduce claims to inputs by construction. The central observation (Chinese models overtaking US models in summer 2025) is an empirical snapshot, not a derived result that loops back to its own definitions or prior author work. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are invoked; the report rests on empirical data aggregation whose collection rules are not detailed in the abstract.

pith-pipeline@v0.9.0 · 5398 in / 1078 out tokens · 62104 ms · 2026-05-10T17:29:08.214147+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DeGenTWeb: A First Look at LLM-dominant Websites
cs.NI 2026-04 unverdicted novelty 5.0

DeGenTWeb shows LLM-dominant websites are common and increasing in Common Crawl and Bing search results, but accurate detection is getting harder with newer models.

Reference graph

Works this paper leans on

36 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

The data provenance initiative: A large scale audit of dataset licensing & attribution in AI

LMSYS Org blog post, published August 29, 2024. Shayne Longpre et al. The data provenance initiative: A large scale audit of dataset licensing & attribution in AI.arXiv preprint arXiv:2310.16787, 2023. URL https://arxiv.org/abs/2310. 16787. Shayne Longpre et al. Consent in crisis: The rapid decline of the AI data commons.arXiv preprint arXiv:2407.14933, 2...

work page arXiv 2024
[2]

Qwen2.5-Coder-0.5B-Instruct (13.5M)

Qwen3-0.6B (72.8M) 6. Qwen2.5-Coder-0.5B-Instruct (13.5M)
[3]

Qwen2-0.5B (10.7M)

Qwen2.5-0.5B-Instruct (32.3M) 7. Qwen2-0.5B (10.7M)
[4]

SmolLM2-135M (10.5M)

Florence-2-large (19.4M) 8. SmolLM2-135M (10.5M)
[5]

Florence-2-base (8.8M)

t5gemma-b-b-prefixlm (17.5M) 9. Florence-2-base (8.8M)
[6]

llava-onevision-qwen2-0.5b-ov-hf (8.4M) 1-5B

Qwen2.5-0.5B (17.1M) 10. llava-onevision-qwen2-0.5b-ov-hf (8.4M) 1-5B
[7]

Llama-3.2-3B-Instruct (37M)

Qwen2.5-1.5B-Instruct (150.6M) 6. Llama-3.2-3B-Instruct (37M)
[8]

gemma-3-1b-it (34.9M)

Qwen2.5-VL-3B-Instruct (70.7M) 7. gemma-3-1b-it (34.9M)
[9]

Qwen2-VL-2B-Instruct (30M)

Qwen2.5-3B-Instruct (70.1M) 8. Qwen2-VL-2B-Instruct (30M)
[10]

Qwen3-4B (29.7M)

Llama-3.2-1B-Instruct (55.7M) 9. Qwen3-4B (29.7M)
[11]

Qwen3-1.7B (29.2M) 7-9B

Llama-3.2-1B (49.2M) 10. Qwen3-1.7B (29.2M) 7-9B
[12]

Meta-Llama-3-8B-Instruct (38.9M)

Llama-3.1-8B-Instruct (133M) 6. Meta-Llama-3-8B-Instruct (38.9M)
[13]

Meta-Llama-3-8B (36.6M)

Qwen2.5-7B-Instruct (109M) 7. Meta-Llama-3-8B (36.6M)
[14]

Llama-2-7b-chat-hf (29.2M)

Mistral-7B-Instruct-v0.2 (53.5M) 8. Llama-2-7b-chat-hf (29.2M)
[15]

Llama-2-7b-hf (28.4M)

Qwen2.5-VL-7B-Instruct (51.2M) 9. Llama-2-7b-hf (28.4M)
[16]

falcon-7b-instruct (26.7M) 4Also see related, recurring work in this direction from Hugging Face directly (Ghosh et al., 2026)

Qwen3-8B (42.5M) 10. falcon-7b-instruct (26.7M) 4Also see related, recurring work in this direction from Hugging Face directly (Ghosh et al., 2026). atomproject.ai/report 20 / 23 The ATOM Report April 2026 10-50B

2026
[17]

Qwen2.5-32B-Instruct (18.5M)

gpt-oss-20b (54M) 6. Qwen2.5-32B-Instruct (18.5M)
[18]

Llama-3.2-11B-Vision-Instruct (17.6M)

Qwen2.5-14B-Instruct (33.3M) 7. Llama-3.2-11B-Vision-Instruct (17.6M)
[19]

Llama-2-13b-chat-hf (15.2M)

Qwen3-32B (24.6M) 8. Llama-2-13b-chat-hf (15.2M)
[20]

Qwen3-VL-30B-A3B-Instruct (13.2M)

DeepSeek-R1-Distill-Qwen-32B (23M) 9. Qwen3-VL-30B-A3B-Instruct (13.2M)
[21]

gemma-3-27b-it (12.3M) 50-100B

Mixtral-8x7B-Instruct-v0.1 (20M) 10. gemma-3-27b-it (12.3M) 50-100B
[22]

Qwen2.5-VL-72B-Instruct (5.7M)

Llama-3.1-70B-Instruct (20.2M) 6. Qwen2.5-VL-72B-Instruct (5.7M)
[23]

Qwen2.5-72B-Instruct (5.4M)

Qwen3-Next-80B-A3B-Instruct (14.6M) 7. Qwen2.5-72B-Instruct (5.4M)
[24]

Llama-2-70b-chat-hf (4.6M)

Llama-3.3-70B-Instruct (10.3M) 8. Llama-2-70b-chat-hf (4.6M)
[25]

DeepSeek-R1-Distill-Llama-70B (4.3M)

InternVL3-78B (6.2M) 9. DeepSeek-R1-Distill-Llama-70B (4.3M)
[26]

Meta-Llama-3-70B (3.2M) 100-250B

Meta-Llama-3-70B-Instruct (5.9M) 10. Meta-Llama-3-70B (3.2M) 100-250B
[27]

InternVL3 5-241B-A28B-Instruct (4.1M)

gpt-oss-120b (29.2M) 6. InternVL3 5-241B-A28B-Instruct (4.1M)
[28]

Qwen3-235B-A22B (3.4M)

Mixtral-8x22B-Instruct-v0.1 (6M) 7. Qwen3-235B-A22B (3.4M)
[29]

Qwen3-VL-235B-A22B-Thinking (3.3M)

Mistral-Large-Instruct-2407 (5M) 8. Qwen3-VL-235B-A22B-Thinking (3.3M)
[30]

Qwen3-235B-A22B-Instruct-2507-FP8 (2.7M)

Mistral-Large-Instruct-2411 (4.9M) 9. Qwen3-235B-A22B-Instruct-2507-FP8 (2.7M)
[31]

MiniMax-M2 (1.9M) 250B+

Mixtral-8x22B-v0.1 (4.8M) 10. MiniMax-M2 (1.9M) 250B+
[32]

GLM-5-FP8 (4.9M)

Llama-3.1-405B (20.3M) 6. GLM-5-FP8 (4.9M)
[33]

DeepSeek-V3-0324 (4M)

DeepSeek-R1 (16.7M) 7. DeepSeek-V3-0324 (4M)
[34]

Llama-3.1-405B-Instruct (3.4M)

DeepSeek-V3 (14.3M) 8. Llama-3.1-405B-Instruct (3.4M)
[35]

Qwen3.5-397B-A17B (2.1M)

DeepSeek-R1-0528 (5.9M) 9. Qwen3.5-397B-A17B (2.1M)
[36]

Kimi-K2-Instruct (1.8M) C Additional RAM Details The top-10 model download counts over time, used to compute the RAM scores, is shown in Figure 17

Kimi-K2.5 (5.4M) 10. Kimi-K2-Instruct (1.8M) C Additional RAM Details The top-10 model download counts over time, used to compute the RAM scores, is shown in Figure 17. This shows that among the top few models in each size category, the median of top-10 downloads over the first 180 days is remarkably similar across buckets. The smallest models have larger...

2026