arxiv: 2605.14427 · v1 · submitted 2026-05-14 · 💻 cs.CL · cs.SD

Recognition: 2 theorem links

· Lean Theorem

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

Sunil Kumar Kopparapu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:04 UTC · model grok-4.3

classification 💻 cs.CL cs.SD

keywords end-to-end ASRvocabulary sizetokenizationBPEcalculus optimizationderivative testLibrispeechhyper-parameter

0 comments

The pith

Calculus locates the optimal vocabulary size for end-to-end ASR by fitting a cost curve and applying derivative tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end ASR systems treat vocabulary size as a critical hyper-parameter that tokenization algorithms like BPE receive before generating sub-word units. Prior work introduced a cost function to score different sizes without full model training. This paper fits a curve to those cost values and uses first and second derivative tests to identify the size at the minimum. Application to Librispeech shows that the resulting size produces lower word error rates than the fixed values in standard training recipes. The method converts vocabulary selection from an empirical search into an explicit calculus optimization step.

Core claim

The paper claims that modeling tokenization cost as a function of vocabulary size, fitting a curve to observed values, and locating the minimum via first and second derivative tests yields a vocabulary size that measurably improves end-to-end ASR performance on the Librispeech corpus.

What carries the argument

A smooth curve fitted to cost-function values computed at multiple vocabulary sizes, whose minimum is found by first and second derivative tests.

If this is right

Vocabulary size can be chosen systematically rather than by repeated trial runs.
ASR word error rates decrease when models are trained at the size identified by the derivative tests.
The same curve-fitting procedure applies to any tokenization algorithm viewed as a black box.
Standard training recipes can replace fixed vocabulary sizes with these calculated optima.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same derivative-based approach could be tested on other hyper-parameters whose effect on a cost metric can be sampled.
If the cost surface proves non-convex, numerical optimization routines would become a natural next extension.
The framework may reduce the total compute spent on hyper-parameter sweeps during ASR development.

Load-bearing premise

The cost values for different vocabulary sizes can be fitted by a differentiable curve whose minimum is both identifiable by derivatives and predictive of better ASR accuracy.

What would settle it

Train separate end-to-end ASR models at several vocabulary sizes around the predicted optimum and verify whether the measured word error rate reaches its lowest point exactly at the calculus-derived size.

Figures

Figures reproduced from arXiv: 2605.14427 by Sunil Kumar Kopparapu.

**Figure 2.** Figure 2: ∆norm(n) (right), Θnorm(n) (left) shown in red are the result of second order polynomial fit (Eq (22)) while ∆norm(n) and Θnorm(n) derived from LibriSpeech-100 corpus is shown as the blue curve. The y-axis is between 0 and 1 because of the normalization as seen in (22). Algorithm 1 Find n (using second order polynomial) 1: d2 = 2.48 ∗ 10−8 , d1 = −1.76 ∗ 10−4 , d0 = 3.06 ∗ 10−3 2: f2 = 2.37 ∗ 10−8 , f1 = … view at source ↗

**Figure 3.** Figure 3: Exponential and second order polynomial to represent [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds curve fitting and derivative tests to a prior cost function for picking ASR vocabulary size, but the abstract shows no numbers or fit details to back the Librispeech improvement claim.

read the letter

The paper takes the cost function idea from prior work on vocabulary size for end-to-end ASR and adds curve fitting plus first and second derivative tests to locate the optimal size. They apply it to Librispeech and say it improves performance, but the abstract gives no numbers, no baselines, and no details on the fit or the derivative calculations. They do a decent job of turning an informal choice into a more systematic calculus step. That part is new compared to the black-box cost in the cited reference, and it could help people avoid guessing the vocab size in their recipes. The formalization using derivatives is a clear procedural addition. The main weakness is the missing evidence. We don't see the actual curve, the goodness-of-fit metrics, or the word error rate improvement over standard choices like 5000 or 10000 tokens. Without that, the claim that the mathematical optimum gives better ASR stays untested in the visible parts. The circularity also stands out because the optimum comes from fitting to training-derived costs, so any gain needs to be shown on separate test data to rule out overfitting to the curve itself. This kind of paper is for ASR practitioners who spend time on tokenization hyperparameters. Someone building or tuning models in ESPNet or similar would find the procedure interesting if the results check out. It might also appeal to people who like applying basic calculus to hyperparameter tuning in general. I would send it for peer review. The core idea is sound enough to deserve a look from referees who can check the experiments and math. Even with the gaps, it formalizes something useful.

Referee Report

2 major / 2 minor

Summary. The paper proposes a calculus-based framework for determining the optimal vocabulary size hyper-parameter in end-to-end ASR systems. It extends prior black-box cost-function work on tokenization by fitting a curve to training-derived cost values and applying first- and second-derivative tests to locate the minimum; the authors claim that the resulting vocabulary size improves ASR performance when applied to the Librispeech corpus.

Significance. If the empirical demonstration holds with quantitative validation, the approach would supply a systematic, derivative-based procedure for tuning a critical hyper-parameter that is currently chosen largely by ad-hoc means in toolkits such as ESPNet, potentially reducing trial-and-error and improving model efficiency across datasets.

major comments (2)

[Abstract] Abstract: the central claim that the derived vocabulary size 'improves the performance of the ASR' on Librispeech is asserted without any reported WER values, baselines, fitting coefficients, or goodness-of-fit statistics, leaving the empirical utility unsupported by visible evidence.
[Method] Method section (curve-fitting step): the optimal size is located by fitting parameters to training-derived cost values and finding the extremum of that fit; by construction the result is determined by the chosen functional form and training data rather than by an independent test on held-out data, undermining the claim of a general, non-circular optimum.

minor comments (2)

Provide the explicit functional form fitted to the cost data and report quantitative fit metrics (R², residuals) so readers can assess whether the first- and second-derivative tests are applied to a well-behaved curve.
Include the full bibliographic entry for the referenced prior work [1] and clarify how the new calculus step differs from the black-box cost framework it builds upon.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our results and methodology.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the derived vocabulary size 'improves the performance of the ASR' on Librispeech is asserted without any reported WER values, baselines, fitting coefficients, or goodness-of-fit statistics, leaving the empirical utility unsupported by visible evidence.

Authors: We agree that the abstract should provide quantitative support for the central claim. In the revised manuscript we will add the key WER values achieved with the derived vocabulary size, the corresponding baselines, the fitted coefficients, and the goodness-of-fit statistics so that the empirical utility is evident directly from the abstract. revision: yes
Referee: [Method] Method section (curve-fitting step): the optimal size is located by fitting parameters to training-derived cost values and finding the extremum of that fit; by construction the result is determined by the chosen functional form and training data rather than by an independent test on held-out data, undermining the claim of a general, non-circular optimum.

Authors: The cost values are computed from the tokenization of the training corpus because vocabulary size directly governs the tokenization step used in ASR training. The curve fit and derivative tests then locate the minimum of this cost function in a systematic way. We will revise the method section to clarify that the vocabulary size identified by this procedure is subsequently used to train an ASR model whose performance is measured on the standard held-out Librispeech test sets, thereby providing an independent evaluation of the resulting WER improvement. revision: partial

Circularity Check

1 steps flagged

Fitted cost curve extremum presented as independently optimal vocabulary size

specific steps

fitted input called prediction [Abstract]
"we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR."

The vocabulary size is estimated by fitting a curve to cost-function data derived from the tokenization process and locating its extremum via derivatives; the reported optimum is therefore the direct mathematical consequence of the fitted model rather than an externally validated choice independent of the fit.

full rationale

The derivation fits a curve to cost values computed directly from the tokenization process on training data, then applies first- and second-derivative tests to locate the minimum; the resulting vocabulary size is therefore the mathematical extremum of that fitted function by construction. The paper then reports that this size improves ASR performance on Librispeech, but the identification step itself reduces to properties of the fit rather than an independent test. This constitutes a fitted-input-called-prediction pattern with partial circularity; the central claim retains some external validation via the held-out ASR evaluation, preventing a higher score.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the prior cost function being evaluable across vocabulary sizes, the existence of a smooth differentiable curve that can be fitted, and the assumption that the calculus-derived minimum aligns with empirical ASR gains.

free parameters (1)

curve fitting coefficients
Parameters of the curve fitted to cost values computed from the training corpus at different vocabulary sizes.

axioms (1)

domain assumption The tokenization cost function is a smooth, twice-differentiable function of vocabulary size
Required to apply first and second derivative tests to locate a minimum.

pith-pipeline@v0.9.0 · 5569 in / 1344 out tokens · 52655 ms · 2026-05-15T02:04:22.903529+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

C(n) = α1 n + α2 Δ(n) + α3 Θ(n) ... dC/dn = α1 + α2 Δ'(n) + α3 Θ'(n) = 0 ... second derivative test d²C/dn² > 0
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Δ(n) ≜ d2 n² + d1 n + d0, Θ(n) ≜ f2 n² + f1 n + f0 ... n = −(α1 + α2 d1 + α3 f1) / 2(α2 d2 + α3 f2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system,

S. K. Kopparapu and A. Panda, “A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system,” inPro- ceedings of the 2024 International Conference on Pattern Recognition (ICPR), Kolkata, India, December 1–5 2024

work page 2024
[2]

Librispeech ASR corpus: train-clean-100,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech ASR corpus: train-clean-100,” https://www.openslr.org/resources/12/ train-clean-100.tar.gz, 2015, accessed: 2024-06-26

work page 2015
[3]

D. C. Montgomery, E. A. Peck, and G. G. Vining,Introduction to Linear Regression Analysis, 5th ed. Hoboken, NJ: John Wiley & Sons,

work page
[4]

Available: https://www.wiley.com/en-us/Introduction+ to+Linear+Regression+Analysis,+5th+Edition-p-9781119578727

[Online]. Available: https://www.wiley.com/en-us/Introduction+ to+Linear+Regression+Analysis,+5th+Edition-p-9781119578727

work page
[5]

ESPnet: End-to-End Speech Processing Toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduch- intala, and T. Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” inProc. Interspeech 2018, 2018, pp. 2207–2211

work page 2018
[6]

Librispeech ASR corpus: test-clean-100,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech ASR corpus: test-clean-100,” https://www.openslr.org/resources/12/test-clean. tar.gz, 2015, accessed: 2024-06-26

work page 2015
[7]

Librispeech ASR corpus: test-other-100,

——, “Librispeech ASR corpus: test-other-100,” https://www.openslr. org/resources/12/test-other.tar.gz, 2015, accessed: 2024-06-26

work page 2015
[8]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented transformer for speech recognition,” inInterspeech 2020, 2020, pp. 5036–5040

work page 2020
[9]

Adam: A method for stochastic optimiza- tion,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2015

work page 2015
[10]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017

work page 2017
[11]

Audio augmentation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” inProc. Interspeech 2015, 2015, pp. 3586–3589

work page 2015
[12]

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” inProc. Interspeech 2019, 2019, pp. 2613–2617. APPENDIXA EXPONENTIAL ANDSECOND ORDERPOLYNOMIALFIT As mentioned,∆ exp(n)(13) andΘ exp(n)(14) represent the second order polynomial with an exponen...

work page 2019
[13]

Table IV shows a list ofα 1,2,3’s which result inn= 300

which satisfyn≈300(Line 3, Algorithm 3). Table IV shows a list ofα 1,2,3’s which result inn= 300. While different values ofα 1,2,3’s (example(23.01,−31.56,0.02)and (263571.00,−361185.85,246.75)) are valid it can be seen that the ratios α1 α3 , α2 α3 , and α3 α3 are quite close to each other (see Table IV). n= 300 Initial (α1, α2, α3) α1 α2 α3 α1 α3 α2 α3 ...

work page