pith. machine review for the scientific record. sign in

arxiv: 2605.00300 · v1 · submitted 2026-05-01 · 💻 cs.AI · cs.DC· cs.LG· cs.PF

Recognition: unknown

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:10 UTC · model grok-4.3

classification 💻 cs.AI cs.DCcs.LGcs.PF
keywords AI inference benchmarkingendpoint evaluationenergy efficiencymodel accuracy variationlatencyworkload pricingcontinuous benchmarking
0
0 comments X

The pith

The same AI model varies by up to 12.5 accuracy points and a factor of 6 in energy use across different endpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

TokenArena evaluates AI inference at the endpoint level, the specific combination of model, provider, quantization, and serving setup where real decisions occur. It records output speed, time to first token, blended price under different workloads, effective context, live quality, and modeled energy, then folds these into three summary scores: joules per correct answer, dollars per correct answer, and similarity to a first-party reference output distribution. Measurements on 78 endpoints across 12 model families show the same model can swing 12.5 points in accuracy on math and code tasks, 12 points in output similarity, an order of magnitude in tail latency, and more than six times in energy per correct answer. Switching the workload ratio used for pricing also moves seven of the top-ten endpoints out of the top ten.

Core claim

Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code, in fingerprint similarity to first party by up to 12 points, in tail latency by an order of magnitude, and in modeled joules per correct answer by a factor of 6.2. Workload-aware blended pricing reorders the leaderboard substantially: 7 of 10 top-ranked endpoints under the chat preset (3:1 input:output) fall out of the top 10 under the retrieval-augmented preset (20:1), and the reasoning preset (1:5) elevates frontier closed models that the chat preset penalizes on price.

What carries the argument

TokenArena, a continuous benchmark that measures live endpoints on five axes (output speed, time to first token, workload-blended price, effective context, quality) plus modeled energy and collapses them into composite scores of joules per correct answer, dollars per correct answer, and endpoint fidelity.

If this is right

  • Endpoints serving the same model are not interchangeable in quality, speed, or efficiency.
  • Leaderboard position depends on the input-to-output ratio of the target workload.
  • Continuous re-evaluation is required because endpoint behavior can shift over time.
  • Open release of the probe, harness, and schema allows independent replication and extension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the energy models hold, selection by joules per correct answer could guide lower-carbon deployments.
  • The fidelity metric may expose differences in quantization or decoding that affect downstream reliability.
  • Workload-specific rankings suggest providers could offer tiered endpoints optimized for chat versus retrieval use cases.

Load-bearing premise

Modeled energy numbers match real power draw and the quality scores on live endpoints give a fair, comparable measure of correct answers across providers.

What would settle it

Direct wattage measurements taken while running the benchmark prompts on a subset of the 78 endpoints to test whether actual energy per correct answer matches the modeled values within the observed factor-of-6.2 spread.

Figures

Figures reproduced from arXiv: 2605.00300 by Megan Wang, Yi Ling Yu, Yuxuan Gao.

Figure 1
Figure 1. Figure 1: Token Arena pipeline. Three measurement loops — probe (continuous), eval (daily and view at source ↗
read the original abstract

Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving stack is exposed. We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). The framework's novelty is empirical and methodological. Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code, in fingerprint similarity to first party by up to 12 points, in tail latency by an order of magnitude, and in modeled joules per correct answer by a factor of 6.2. We further show that workload-aware blended pricing reorders the leaderboard substantially: 7 of 10 top-ranked endpoints under the chat preset (3:1 input:output) fall out of the top 10 under the retrieval-augmented preset (20:1), and the reasoning preset (1:5) elevates frontier closed models that the chat preset penalizes on price. We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0. TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TokenArena, a continuous benchmark for AI inference at endpoint granularity. It evaluates 78 endpoints across 12 model families on output speed, time to first token, workload-blended price, effective context, live-endpoint quality, and a modeled energy estimate, synthesizing them into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). Key empirical findings include variations of up to 12.5 points in mean accuracy on math/code tasks, 12 points in fingerprint similarity, an order of magnitude in tail latency, and a factor of 6.2 in modeled joules per correct answer; workload-aware blended pricing is shown to reorder leaderboards substantially across chat, retrieval-augmented, and reasoning presets. The framework, schema, probe, eval harness, and v1.0 snapshot are released under CC BY 4.0.

Significance. If the central measurements hold, the work provides a deployment-relevant unification of energy and cognitive metrics at the actual unit of choice (endpoints rather than models), with clear practical implications for how pricing and workload presets affect rankings. The empirical scope across 78 endpoints and the public release of artifacts, full provenance, and limitations are notable strengths that enable replication and extension. The demonstration that 7 of 10 top endpoints under one preset fall out under another underscores the value of workload-aware evaluation.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Energy Modeling): The headline claim of up to 6.2× variation in modeled joules per correct answer (one of the three core composites driving energy-based conclusions) rests on an energy estimate described only as 'modeled' with no reported calibration, hardware-specific power curves, direct metering on the evaluated endpoints, or error analysis. This is load-bearing for the joules metric and any energy-related reordering or efficiency claims.
  2. [§4] §4 (Empirical Results): The reported quality metrics (mean accuracy differences up to 12.5 points and fingerprint similarity up to 12 points) on live endpoints are presented without error bars, statistical significance tests, or explicit discussion of cross-provider comparability of 'correct answers,' which directly affects the reliability of the variation claims and the fidelity composite.
minor comments (2)
  1. [§2] §2 (Related Work): A brief comparison table with prior inference benchmarks (e.g., on latency or cost) would clarify the incremental contribution of endpoint granularity and the energy axis.
  2. [§5] Figure captions and §5 (Leaderboard Analysis): Ensure all workload presets (chat 3:1, retrieval 20:1, reasoning 1:5) are defined with explicit token ratios in the main text for readers to replicate the reordering results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our energy modeling and empirical results. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] The headline claim of up to 6.2× variation in modeled joules per correct answer rests on an energy estimate described only as 'modeled' with no reported calibration, hardware-specific power curves, direct metering on the evaluated endpoints, or error analysis. This is load-bearing for the joules metric and any energy-related reordering or efficiency claims.

    Authors: We acknowledge the need for greater transparency in the energy modeling. The estimates rely on a model derived from published hardware TDP values, typical inference power curves from vendor documentation, and workload-specific utilization factors rather than direct metering, as live cloud endpoints do not expose hardware-level power data. We will expand §3 with a full description of the model equations, data sources for power curves, and an explicit error analysis (including sensitivity to utilization assumptions and bounds on the reported 6.2× factor). We will also add caveats in the abstract and §5 noting that these are modeled rather than metered values and discuss implications for the joules-per-correct-answer composite. revision: yes

  2. Referee: [§4] The reported quality metrics (mean accuracy differences up to 12.5 points and fingerprint similarity up to 12 points) on live endpoints are presented without error bars, statistical significance tests, or explicit discussion of cross-provider comparability of 'correct answers,' which directly affects the reliability of the variation claims and the fidelity composite.

    Authors: We agree that statistical support and comparability details will improve interpretability. In the revision we will add error bars (standard error across prompt sets) to the accuracy and similarity figures in §4, report statistical significance tests (e.g., paired t-tests on per-endpoint scores), and include a new paragraph discussing cross-provider comparability of correct answers. This will cover standardization via our open eval harness, tokenization differences, and any provider-specific output constraints, while noting how these affect the fidelity metric. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no self-referential derivation or fitted predictions

full rationale

The paper introduces TokenArena as a new continuous benchmark that collects direct empirical measurements of latency, accuracy, price, fidelity, and a modeled energy estimate across 78 live endpoints. Headline results (e.g., up to 12.5-point accuracy differences, 6.2x modeled joules variation, and leaderboard reordering under different workloads) are presented as observations from these measurements rather than derived quantities. No equations, self-citations, or ansatzes are shown that reduce any composite metric to its inputs by construction; the energy component is explicitly labeled as modeled without any claim that it is fitted to the same data it evaluates. The work is self-contained as a methodology with released artifacts, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central composites rest on a modeled energy estimate whose parameters are not detailed and on the assumption that live-endpoint quality scores are comparable across providers.

free parameters (1)
  • energy model coefficients
    Used to derive the modeled joules per correct answer from inference metrics.
axioms (1)
  • domain assumption Quality metrics on live endpoints provide a fair proxy for 'correct answers' across heterogeneous providers and stacks.
    Invoked to define the denominator in joules-per-correct-answer and dollars-per-correct-answer composites.

pith-pipeline@v0.9.0 · 5627 in / 1434 out tokens · 58735 ms · 2026-05-09T20:10:13.717603+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 9 canonical work pages · 9 internal anchors

  1. [1]

    AA-LCR: Artificial Analysis Long-Context Reasoning Benchmark

    Artificial Analysis. AA-LCR: Artificial Analysis Long-Context Reasoning Benchmark. https:// artificialanalysis.ai/methodology/long-context, 2025

  2. [2]

    Evaluating Large Language Models Trained on Code

    M. Chen et al. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374, 2021

  3. [3]

    Chiang et al

    W.-L. Chiang et al. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.ICML, 2024

  4. [4]

    Real-time electricity carbon-intensity API

    Electricity Maps. Real-time electricity carbon-intensity API. https://www.electricitymaps.com, 2024

  5. [5]

    Open-Source LLM Observability Dashboard.https://helicone.ai, 2024

    Helicone. Open-Source LLM Observability Dashboard.https://helicone.ai, 2024

  6. [6]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring Mathematical Problem Solving with the MATH Dataset.NeurIPS Datasets and Benchmarks, 2021

  7. [7]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring Massive Multitask Language Understanding.ICLR, 2021

  8. [8]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. RULER: What’s the Real Context Size of Your Long-Context Language Models?arXiv preprint arXiv:2404.06654, 2024

  9. [9]

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974, 2024

  10. [10]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR, 2024

  11. [11]

    Liang et al

    P. Liang et al. Holistic Evaluation of Language Models.Annals of the New York Academy of Sciences, 2023

  12. [12]

    Liu et al

    X. Liu et al. AgentBench: Evaluating LLMs as Agents.ICLR, 2024

  13. [13]

    A. S. Luccioni, S. Viguier, and A.-L. Ligozat. Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model.Journal of Machine Learning Research, 24(253):1–15, 2023

  14. [14]

    GAIA: a benchmark for General AI Assistants

    G. Mialon et al. GAIA: A Benchmark for General AI Assistants.arXiv preprint arXiv:2311.12983, 2023

  15. [15]

    Carbon Emissions and Large Neural Network Training

    D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean. Carbon Emissions and Large Neural Network Training.arXiv preprint arXiv:2104.10350, 2021

  16. [16]

    I. D. Raji, E. M. Bender, A. Paullada, E. Denton, and A. Hanna. AI and the Everything in the Whole Wide World Benchmark.NeurIPS, 2021

  17. [17]

    V . J. Reddi et al. MLPerf Inference Benchmark.ISCA, 2020

  18. [18]

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark.arXiv preprint arXiv:2311.12022, 2023

  19. [19]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters.arXiv preprint arXiv:2408.03314, 2024. 10

  20. [20]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of- Thought Prompting Elicits Reasoning in Large Language Models.NeurIPS, 2022

  21. [21]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    S. Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.arXiv preprint arXiv:2406.12045, 2024

  22. [22]

    Zheng et al

    L. Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS, 2023

  23. [23]

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction-Following Evaluation for Large Language Models.arXiv preprint arXiv:2311.07911, 2024. A Endpoint Registry This appendix documents the v1.0 endpoint registry, the inclusion criteria, and the per-provider category breakdown. A.1 Inclusion criteria Endpoints were includ...

  24. [24]

    exposed a publicly-accessible inference API (free or paid; sign-up gates are acceptable, invite gates are not)

  25. [25]

    served at least one of the 12 model families covered in v1.0 (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.5, Gemini 3.1 Pro, Grok 4, DeepSeek V3.2, gpt-oss-120B, Llama 3.3 70B, Mistral Large 2, Qwen 3.5, Kimi K2.6, GLM 5)

  26. [26]

    published or made discoverable a stable identifier for their SKU (Reference, Turbo, Fast, FP8, etc.) so that endpoint identity is verifiable. Endpoints were excluded if they (a) served only superseded model versions; (b) were exclusively private (enterprise-only with no public access tier); (c) did not respond to probes for≥14 consecutive days during the ...