Recognition: unknown
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
Pith reviewed 2026-05-09 20:10 UTC · model grok-4.3
The pith
The same AI model varies by up to 12.5 accuracy points and a factor of 6 in energy use across different endpoints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code, in fingerprint similarity to first party by up to 12 points, in tail latency by an order of magnitude, and in modeled joules per correct answer by a factor of 6.2. Workload-aware blended pricing reorders the leaderboard substantially: 7 of 10 top-ranked endpoints under the chat preset (3:1 input:output) fall out of the top 10 under the retrieval-augmented preset (20:1), and the reasoning preset (1:5) elevates frontier closed models that the chat preset penalizes on price.
What carries the argument
TokenArena, a continuous benchmark that measures live endpoints on five axes (output speed, time to first token, workload-blended price, effective context, quality) plus modeled energy and collapses them into composite scores of joules per correct answer, dollars per correct answer, and endpoint fidelity.
If this is right
- Endpoints serving the same model are not interchangeable in quality, speed, or efficiency.
- Leaderboard position depends on the input-to-output ratio of the target workload.
- Continuous re-evaluation is required because endpoint behavior can shift over time.
- Open release of the probe, harness, and schema allows independent replication and extension.
Where Pith is reading between the lines
- If the energy models hold, selection by joules per correct answer could guide lower-carbon deployments.
- The fidelity metric may expose differences in quantization or decoding that affect downstream reliability.
- Workload-specific rankings suggest providers could offer tiered endpoints optimized for chat versus retrieval use cases.
Load-bearing premise
Modeled energy numbers match real power draw and the quality scores on live endpoints give a fair, comparable measure of correct answers across providers.
What would settle it
Direct wattage measurements taken while running the benchmark prompts on a subset of the 78 endpoints to test whether actual energy per correct answer matches the modeled values within the observed factor-of-6.2 spread.
Figures
read the original abstract
Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving stack is exposed. We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). The framework's novelty is empirical and methodological. Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code, in fingerprint similarity to first party by up to 12 points, in tail latency by an order of magnitude, and in modeled joules per correct answer by a factor of 6.2. We further show that workload-aware blended pricing reorders the leaderboard substantially: 7 of 10 top-ranked endpoints under the chat preset (3:1 input:output) fall out of the top 10 under the retrieval-augmented preset (20:1), and the reasoning preset (1:5) elevates frontier closed models that the chat preset penalizes on price. We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0. TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TokenArena, a continuous benchmark for AI inference at endpoint granularity. It evaluates 78 endpoints across 12 model families on output speed, time to first token, workload-blended price, effective context, live-endpoint quality, and a modeled energy estimate, synthesizing them into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). Key empirical findings include variations of up to 12.5 points in mean accuracy on math/code tasks, 12 points in fingerprint similarity, an order of magnitude in tail latency, and a factor of 6.2 in modeled joules per correct answer; workload-aware blended pricing is shown to reorder leaderboards substantially across chat, retrieval-augmented, and reasoning presets. The framework, schema, probe, eval harness, and v1.0 snapshot are released under CC BY 4.0.
Significance. If the central measurements hold, the work provides a deployment-relevant unification of energy and cognitive metrics at the actual unit of choice (endpoints rather than models), with clear practical implications for how pricing and workload presets affect rankings. The empirical scope across 78 endpoints and the public release of artifacts, full provenance, and limitations are notable strengths that enable replication and extension. The demonstration that 7 of 10 top endpoints under one preset fall out under another underscores the value of workload-aware evaluation.
major comments (2)
- [Abstract and §3] Abstract and §3 (Energy Modeling): The headline claim of up to 6.2× variation in modeled joules per correct answer (one of the three core composites driving energy-based conclusions) rests on an energy estimate described only as 'modeled' with no reported calibration, hardware-specific power curves, direct metering on the evaluated endpoints, or error analysis. This is load-bearing for the joules metric and any energy-related reordering or efficiency claims.
- [§4] §4 (Empirical Results): The reported quality metrics (mean accuracy differences up to 12.5 points and fingerprint similarity up to 12 points) on live endpoints are presented without error bars, statistical significance tests, or explicit discussion of cross-provider comparability of 'correct answers,' which directly affects the reliability of the variation claims and the fidelity composite.
minor comments (2)
- [§2] §2 (Related Work): A brief comparison table with prior inference benchmarks (e.g., on latency or cost) would clarify the incremental contribution of endpoint granularity and the energy axis.
- [§5] Figure captions and §5 (Leaderboard Analysis): Ensure all workload presets (chat 3:1, retrieval 20:1, reasoning 1:5) are defined with explicit token ratios in the main text for readers to replicate the reordering results.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our energy modeling and empirical results. We address each major point below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] The headline claim of up to 6.2× variation in modeled joules per correct answer rests on an energy estimate described only as 'modeled' with no reported calibration, hardware-specific power curves, direct metering on the evaluated endpoints, or error analysis. This is load-bearing for the joules metric and any energy-related reordering or efficiency claims.
Authors: We acknowledge the need for greater transparency in the energy modeling. The estimates rely on a model derived from published hardware TDP values, typical inference power curves from vendor documentation, and workload-specific utilization factors rather than direct metering, as live cloud endpoints do not expose hardware-level power data. We will expand §3 with a full description of the model equations, data sources for power curves, and an explicit error analysis (including sensitivity to utilization assumptions and bounds on the reported 6.2× factor). We will also add caveats in the abstract and §5 noting that these are modeled rather than metered values and discuss implications for the joules-per-correct-answer composite. revision: yes
-
Referee: [§4] The reported quality metrics (mean accuracy differences up to 12.5 points and fingerprint similarity up to 12 points) on live endpoints are presented without error bars, statistical significance tests, or explicit discussion of cross-provider comparability of 'correct answers,' which directly affects the reliability of the variation claims and the fidelity composite.
Authors: We agree that statistical support and comparability details will improve interpretability. In the revision we will add error bars (standard error across prompt sets) to the accuracy and similarity figures in §4, report statistical significance tests (e.g., paired t-tests on per-endpoint scores), and include a new paragraph discussing cross-provider comparability of correct answers. This will cover standardization via our open eval harness, tokenization differences, and any provider-specific output constraints, while noting how these affect the fidelity metric. revision: yes
Circularity Check
Empirical benchmark with no self-referential derivation or fitted predictions
full rationale
The paper introduces TokenArena as a new continuous benchmark that collects direct empirical measurements of latency, accuracy, price, fidelity, and a modeled energy estimate across 78 live endpoints. Headline results (e.g., up to 12.5-point accuracy differences, 6.2x modeled joules variation, and leaderboard reordering under different workloads) are presented as observations from these measurements rather than derived quantities. No equations, self-citations, or ansatzes are shown that reduce any composite metric to its inputs by construction; the energy component is explicitly labeled as modeled without any claim that it is fitted to the same data it evaluates. The work is self-contained as a methodology with released artifacts, satisfying the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- energy model coefficients
axioms (1)
- domain assumption Quality metrics on live endpoints provide a fair proxy for 'correct answers' across heterogeneous providers and stacks.
Reference graph
Works this paper leans on
-
[1]
AA-LCR: Artificial Analysis Long-Context Reasoning Benchmark
Artificial Analysis. AA-LCR: Artificial Analysis Long-Context Reasoning Benchmark. https:// artificialanalysis.ai/methodology/long-context, 2025
2025
-
[2]
Evaluating Large Language Models Trained on Code
M. Chen et al. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Chiang et al
W.-L. Chiang et al. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.ICML, 2024
2024
-
[4]
Real-time electricity carbon-intensity API
Electricity Maps. Real-time electricity carbon-intensity API. https://www.electricitymaps.com, 2024
2024
-
[5]
Open-Source LLM Observability Dashboard.https://helicone.ai, 2024
Helicone. Open-Source LLM Observability Dashboard.https://helicone.ai, 2024
2024
-
[6]
Hendrycks, C
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring Mathematical Problem Solving with the MATH Dataset.NeurIPS Datasets and Benchmarks, 2021
2021
-
[7]
Hendrycks, C
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring Massive Multitask Language Understanding.ICLR, 2021
2021
-
[8]
RULER: What's the Real Context Size of Your Long-Context Language Models?
C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. RULER: What’s the Real Context Size of Your Long-Context Language Models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR, 2024
2024
-
[11]
Liang et al
P. Liang et al. Holistic Evaluation of Language Models.Annals of the New York Academy of Sciences, 2023
2023
-
[12]
Liu et al
X. Liu et al. AgentBench: Evaluating LLMs as Agents.ICLR, 2024
2024
-
[13]
A. S. Luccioni, S. Viguier, and A.-L. Ligozat. Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model.Journal of Machine Learning Research, 24(253):1–15, 2023
2023
-
[14]
GAIA: a benchmark for General AI Assistants
G. Mialon et al. GAIA: A Benchmark for General AI Assistants.arXiv preprint arXiv:2311.12983, 2023
work page internal anchor Pith review arXiv 2023
-
[15]
Carbon Emissions and Large Neural Network Training
D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean. Carbon Emissions and Large Neural Network Training.arXiv preprint arXiv:2104.10350, 2021
work page internal anchor Pith review arXiv 2021
-
[16]
I. D. Raji, E. M. Bender, A. Paullada, E. Denton, and A. Hanna. AI and the Everything in the Whole Wide World Benchmark.NeurIPS, 2021
2021
-
[17]
V . J. Reddi et al. MLPerf Inference Benchmark.ISCA, 2020
2020
-
[18]
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark.arXiv preprint arXiv:2311.12022, 2023
work page internal anchor Pith review arXiv 2023
-
[19]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters.arXiv preprint arXiv:2408.03314, 2024. 10
work page internal anchor Pith review arXiv 2024
-
[20]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of- Thought Prompting Elicits Reasoning in Large Language Models.NeurIPS, 2022
2022
-
[21]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
S. Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review arXiv 2024
-
[22]
Zheng et al
L. Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS, 2023
2023
-
[23]
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction-Following Evaluation for Large Language Models.arXiv preprint arXiv:2311.07911, 2024. A Endpoint Registry This appendix documents the v1.0 endpoint registry, the inclusion criteria, and the per-provider category breakdown. A.1 Inclusion criteria Endpoints were includ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
exposed a publicly-accessible inference API (free or paid; sign-up gates are acceptable, invite gates are not)
-
[25]
served at least one of the 12 model families covered in v1.0 (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.5, Gemini 3.1 Pro, Grok 4, DeepSeek V3.2, gpt-oss-120B, Llama 3.3 70B, Mistral Large 2, Qwen 3.5, Kimi K2.6, GLM 5)
-
[26]
published or made discoverable a stable identifier for their SKU (Reference, Turbo, Fast, FP8, etc.) so that endpoint identity is verifiable. Endpoints were excluded if they (a) served only superseded model versions; (b) were exclusively private (enterprise-only with no public access tier); (c) did not respond to probes for≥14 consecutive days during the ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.