arxiv: 2604.03279 · v2 · submitted 2026-03-24 · 📡 eess.AS · cs.DC· cs.SD

Recognition: 2 theorem links

· Lean Theorem

Rewriting TTS Inference Economics: Lightning V2 on Tenstorrent Achieves 4x Lower Cost Than NVIDIA L40S

Ranjith M. S. , Akshat Mandloi , Sudarshan Kamath

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:27 UTC · model grok-4.3

classification 📡 eess.AS cs.DCcs.SD

keywords text-to-speechlow-precision inferenceBlockFloat8Tenstorrentinference cost optimizationhardware-software co-designaudio quality preservation

0 comments

The pith

Lightning V2 delivers 4x lower TTS inference cost on Tenstorrent than NVIDIA L40S at full production quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a text-to-speech model can be redesigned and optimized specifically for Tenstorrent accelerators to use low-precision computations without hurting audio quality. This yields roughly four times lower hardware cost for the same output speed compared to an NVIDIA L40S setup. The key is combining precision reduction techniques like BlockFloat8 with hardware features such as network-on-chip and distributed memory to cut data movement. If true, this changes how expensive real-time voice generation services can be run on-premise.

Core claim

By co-optimizing the TTS architecture for Tenstorrent's NoC, SRAM, and deterministic execution, Lightning V2 reaches over 95 percent LoFi fidelity and over 80 percent BlockFloat8 usage while keeping audio indistinguishable from full-precision baselines. This produces approximately 4x lower on-prem accelerator cost at equivalent throughput.

What carries the argument

Lightning V2, a precision-aware TTS model co-designed with Tenstorrent hardware features to minimize memory movement and enable aggressive low-precision inference.

If this is right

Real-time TTS services can be deployed at significantly lower hardware expense.
Production audio systems become viable on alternative accelerator platforms beyond NVIDIA.
Precision co-design becomes a standard approach for numerically sensitive generative models.
Overall inference economics for speech synthesis shift toward specialized hardware-software pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may extend to other waveform-generating models like music or video synthesis where small errors are perceptible.
Future hardware designs could prioritize deterministic execution and on-chip networks to support low-precision workloads.
Independent verification of audio quality would require standardized perceptual tests across multiple listening conditions.
Cost savings could compound when scaling to larger batch sizes or multi-speaker setups not tested here.

Load-bearing premise

That the measured audio quality holds up under varied real-world production conditions and that the cost figures fully include all platform overheads on both Tenstorrent and NVIDIA sides.

What would settle it

A controlled A/B listening test using production-grade audio samples from both systems where listeners cannot distinguish them at above-chance levels, or a full system-level cost audit showing the claimed 4x factor.

Figures

Figures reproduced from arXiv: 2604.03279 by Akshat Mandloi, Ranjith M. S., Sudarshan Kamath.

**Figure 2.** Figure 2: Accelerator cost to sustain 550 5-second TTS requests, showing a 3–4 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Text-to-Speech (TTS) models are significantly more numerically fragile than Large Language Models (LLMs) due to their continuous waveform generation and perceptual sensitivity to small numerical perturbations. While aggressive precision reduction techniques such as BlockFloat8 (BFP8) and low-fidelity (LoFi) compute have been widely adopted in language models, applying similar strategies to TTS systems often results in audible artifacts, phase instability, and spectral distortion. In this work, we present Lightning V2, a production-grade TTS model co-optimized for Tenstorrent hardware. Through precision-aware architectural design and hardware-software co-optimization, we achieve over 95% LoFi computational fidelity and more than 80% BlockFloat8 deployment without measurable degradation in audio quality. Leveraging Tenstorrent's Network-on-Chip (NoC), distributed SRAM, and deterministic execution model, we reduce memory movement and redundant weight fetches, enabling efficient low-precision inference. Compared to an NVIDIA L40S baseline, Lightning V2 achieves approximately 4x lower on-prem accelerator cost at equivalent throughput, while maintaining production audio fidelity. Our results demonstrate that precision co-design, combined with hardware-aware optimization, can fundamentally reshape the economics of real-time speech inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows an engineering optimization of TTS inference on Tenstorrent using BFP8 and LoFi to claim 4x lower cost than L40S, but the numbers lack the detail needed to confirm the headline.

read the letter

The core point is that Lightning V2 applies known low-precision methods to a TTS model and tunes it for Tenstorrent's NoC and SRAM to cut on-prem accelerator cost by roughly 4x at matched throughput while keeping audio quality intact. TTS is more sensitive to precision loss than LLMs, so reaching 80% BFP8 and 95% LoFi fidelity without artifacts is the practical advance here. The hardware-aware steps that reduce weight fetches and memory traffic are described clearly enough to follow the approach. That part is useful for anyone who has to run real-time voice on non-standard accelerators. The soft spot is the cost claim itself. The abstract gives the 4x figure but does not show the throughput definition, the exact hardware prices, utilization rates, or how system overheads were counted on both sides. Without those, it is difficult to judge whether the comparison holds under production batch sizes and latency limits. The fidelity side also needs the listening-test protocol and objective scores to be convincing. This work is aimed at deployment engineers who care about inference economics on specific chips rather than new model architectures. It is worth a reading group discussion for the hardware co-design angle. I would not cite it yet because the evidence is still thin. It deserves peer review because the topic is relevant and the basic method is sound, but the referees will need to see the missing tables and test details before the cost number can be taken at face value.

Referee Report

3 major / 1 minor

Summary. The paper presents Lightning V2, a production-grade TTS model co-optimized for Tenstorrent hardware via precision-aware design and hardware-software co-optimization. It claims over 95% LoFi computational fidelity and more than 80% BlockFloat8 deployment without measurable audio quality degradation, leveraging NoC, distributed SRAM, and deterministic execution to achieve approximately 4x lower on-prem accelerator cost than an NVIDIA L40S baseline at equivalent throughput while maintaining production fidelity.

Significance. If the empirical claims hold with full supporting data, the work could meaningfully advance hardware-aware low-precision inference for numerically fragile TTS models, potentially lowering real-time speech synthesis costs on specialized accelerators. The absence of any quantitative benchmarks, however, prevents assessment of whether the result would actually reshape inference economics.

major comments (3)

[Abstract] Abstract: The headline claim of ~4x lower accelerator cost at equivalent throughput is unsupported by any throughput definition (e.g., real-time factor or utterances/sec), hardware pricing breakdown, utilization rates, or cost equation; no table or section supplies these quantities, so the factor cannot be verified or reproduced.
[Abstract] Abstract: Fidelity claims (>95% LoFi, >80% BFP8, no measurable degradation) are stated without listening-test protocols, objective metrics (e.g., PESQ, STOI, or spectral distortion), error bars, baseline configurations, or data-exclusion rules, rendering the 'production audio fidelity' assertion unverifiable.
The manuscript supplies no experimental setup section, results table, or appendix detailing batch sizes, latency bounds, power measurements, or system-level overheads (host CPU, software stack) for either platform, so it is impossible to confirm symmetric accounting in the cost comparison.

minor comments (1)

[Abstract] The abstract uses 'approximately 4x' without defining the exact ratio or confidence interval; a precise definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the current manuscript lacks the detailed experimental data, metrics, and setup information needed to substantiate the abstract claims, and we will revise accordingly by adding the required sections, tables, and protocols. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim of ~4x lower accelerator cost at equivalent throughput is unsupported by any throughput definition (e.g., real-time factor or utterances/sec), hardware pricing breakdown, utilization rates, or cost equation; no table or section supplies these quantities, so the factor cannot be verified or reproduced.

Authors: We accept this point. The revised manuscript will add a dedicated cost-analysis section that explicitly defines throughput using real-time factor and utterances per second, provides hardware pricing breakdowns from public sources, reports utilization rates, and shows the full cost equation. A comparison table for the Tenstorrent and NVIDIA L40S platforms will be included to enable verification and reproduction. revision: yes
Referee: [Abstract] Abstract: Fidelity claims (>95% LoFi, >80% BFP8, no measurable degradation) are stated without listening-test protocols, objective metrics (e.g., PESQ, STOI, or spectral distortion), error bars, baseline configurations, or data-exclusion rules, rendering the 'production audio fidelity' assertion unverifiable.

Authors: We agree that the fidelity claims require supporting evidence. The revision will include a new fidelity-evaluation section that details the listening-test protocols, reports objective metrics (PESQ, STOI, spectral distortion) with error bars, specifies baseline configurations, and states data-exclusion rules. This will substantiate the >95% LoFi computational fidelity and >80% BlockFloat8 deployment without measurable degradation. revision: yes
Referee: [—] The manuscript supplies no experimental setup section, results table, or appendix detailing batch sizes, latency bounds, power measurements, or system-level overheads (host CPU, software stack) for either platform, so it is impossible to confirm symmetric accounting in the cost comparison.

Authors: We acknowledge the absence of an experimental setup section. The revised version will add a comprehensive experimental setup section (with an accompanying results table and appendix) that specifies batch sizes, latency bounds, power measurements, and system-level overheads (including host CPU and software stack) for both platforms. This will ensure symmetric accounting and allow independent verification of the cost comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hardware benchmark with no derivation chain

full rationale

The manuscript presents measured throughput and cost results from running Lightning V2 on Tenstorrent hardware versus an L40S baseline. No equations, fitted parameters, predictions, or self-citations are invoked to derive the 4x cost claim; the figure is stated as an observed outcome of the co-optimized model and platform. The abstract and provided text contain only architectural descriptions and empirical fidelity numbers (95% LoFi, 80% BFP8) without any self-referential reduction or ansatz smuggling. This is a standard empirical comparison paper whose central claim does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no free parameters, axioms, or invented entities; the ledger is therefore empty.

pith-pipeline@v0.9.0 · 5545 in / 1106 out tokens · 31806 ms · 2026-05-15T00:27:59.208455+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through precision-aware architectural design and hardware-software co-optimization, we achieve over 95% LoFi computational fidelity and more than 80% BlockFloat8 deployment without measurable degradation in audio quality. ... Compared to an NVIDIA L40S baseline, Lightning V2 achieves approximately 4× lower on-prem accelerator cost at equivalent concurrency
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Leveraging Tenstorrent’s Network-on-Chip (NoC), distributed SRAM, and deterministic execution model, we reduce memory movement and redundant weight fetches

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Matthew B. Hoy. Alexa, siri, cortana, and more: An introduction to voice assistants.Medical Reference Services Quarterly, 37(1):81–88, 2018

work page 2018
[2]

Fly-tts: Fast, lightweight and high-quality end-to-end text-to-speech synthesis, 2024

Yinlin Guo, Yening Lv, Jinqiao Dou, Yan Zhang, and Yuehai Wang. Fly-tts: Fast, lightweight and high-quality end-to-end text-to-speech synthesis, 2024

work page 2024
[3]

Clear: Continuous latent autoregressive modeling for high-quality and low-latency speech synthesis, 2025

Chun Yat Wu, Jiajun Deng, Guinan Li, Qiuqiang Kong, and Simon Lui. Clear: Continuous latent autoregressive modeling for high-quality and low-latency speech synthesis, 2025

work page 2025
[4]

Cm-tts: Enhancing real time text-to-speech synthesis efficiency through weighted samplers and consistency models, 2024

Xiang Li, Fan Bu, Ambuj Mehrish, Yingting Li, Jiale Han, Bo Cheng, and Soujanya Poria. Cm-tts: Enhancing real time text-to-speech synthesis efficiency through weighted samplers and consistency models, 2024

work page 2024
[5]

From words to watts: Benchmarking the energy costs of large language model inference

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. From words to watts: Benchmarking the energy costs of large language model inference. In2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9, 2023

work page 2023
[6]

Power hungry processing: Watts driving the cost of ai deployment? InThe 2024 ACM Conference on Fairness Accountability and Transparency, FAccT ’24, page 85–99

Sasha Luccioni, Yacine Jernite, and Emma Strubell. Power hungry processing: Watts driving the cost of ai deployment? InThe 2024 ACM Conference on Fairness Accountability and Transparency, FAccT ’24, page 85–99. ACM, June 2024

work page 2024
[7]

FP8 Formats for Deep Learning

Paulius Micikevicius et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review arXiv 2022
[8]

Block floating point.https://en.wikipedia.org/wiki/Block_floating_point, 2026

work page 2026
[9]

Matrix engine technical report (math fidelity) — tt-metal

Tenstorrent. Matrix engine technical report (math fidelity) — tt-metal. https://github.com/tenstorrent/ tt-metal/blob/main/tech_reports/matrix_engine/matrix_engine.md, 2024. Accessed 20206-03-01

work page 2024
[10]

Grad-tts: A diffusion probabilistic model for text-to-speech, 2021

Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech, 2021

work page 2021
[11]

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models, 2024

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models, 2024. 11

work page 2024
[12]

Ai-synthesized voice detection using neural vocoder artifacts, 2023

Chengzhe Sun, Shan Jia, Shuwei Hou, and Siwei Lyu. Ai-synthesized voice detection using neural vocoder artifacts, 2023

work page 2023
[13]

Simple and efficient quantization techniques for neural audio models.arXiv preprint arXiv:2405.08417, 2024

Minje Kim et al. Simple and efficient quantization techniques for neural audio models.arXiv preprint arXiv:2405.08417, 2024

work page arXiv 2024
[14]

Ieee standard for binary floating-point arithmetic.ANSI/IEEE Std 754-1985, pages 1–20, 1985

work page 1985
[15]

Mixed precision training, 2018

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training, 2018

work page 2018
[16]

A study of bfloat16 for deep learning training, 2019

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja V ooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of bfloat16 for d...

work page 2019
[17]

Blackhole & tt-metalium: The standalone ai computer and its programming model

Jasmina Vasiljevic and Davor Capalija. Blackhole & tt-metalium: The standalone ai computer and its programming model. InHot Chips 36 Symposium (HC36), August 2024. Presentation at Hot Chips 2024. 12

work page 2024