pith. machine review for the scientific record. sign in

arxiv: 2605.14427 · v1 · submitted 2026-05-14 · 💻 cs.CL · cs.SD

Recognition: 2 theorem links

· Lean Theorem

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:04 UTC · model grok-4.3

classification 💻 cs.CL cs.SD
keywords end-to-end ASRvocabulary sizetokenizationBPEcalculus optimizationderivative testLibrispeechhyper-parameter
0
0 comments X

The pith

Calculus locates the optimal vocabulary size for end-to-end ASR by fitting a cost curve and applying derivative tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end ASR systems treat vocabulary size as a critical hyper-parameter that tokenization algorithms like BPE receive before generating sub-word units. Prior work introduced a cost function to score different sizes without full model training. This paper fits a curve to those cost values and uses first and second derivative tests to identify the size at the minimum. Application to Librispeech shows that the resulting size produces lower word error rates than the fixed values in standard training recipes. The method converts vocabulary selection from an empirical search into an explicit calculus optimization step.

Core claim

The paper claims that modeling tokenization cost as a function of vocabulary size, fitting a curve to observed values, and locating the minimum via first and second derivative tests yields a vocabulary size that measurably improves end-to-end ASR performance on the Librispeech corpus.

What carries the argument

A smooth curve fitted to cost-function values computed at multiple vocabulary sizes, whose minimum is found by first and second derivative tests.

If this is right

  • Vocabulary size can be chosen systematically rather than by repeated trial runs.
  • ASR word error rates decrease when models are trained at the size identified by the derivative tests.
  • The same curve-fitting procedure applies to any tokenization algorithm viewed as a black box.
  • Standard training recipes can replace fixed vocabulary sizes with these calculated optima.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same derivative-based approach could be tested on other hyper-parameters whose effect on a cost metric can be sampled.
  • If the cost surface proves non-convex, numerical optimization routines would become a natural next extension.
  • The framework may reduce the total compute spent on hyper-parameter sweeps during ASR development.

Load-bearing premise

The cost values for different vocabulary sizes can be fitted by a differentiable curve whose minimum is both identifiable by derivatives and predictive of better ASR accuracy.

What would settle it

Train separate end-to-end ASR models at several vocabulary sizes around the predicted optimum and verify whether the measured word error rate reaches its lowest point exactly at the calculus-derived size.

Figures

Figures reproduced from arXiv: 2605.14427 by Sunil Kumar Kopparapu.

Figure 1
Figure 1. Figure 1: Second-order polynomial fit (Eq. (16)) of (a) [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ∆norm(n) (right), Θnorm(n) (left) shown in red are the result of second order polynomial fit (Eq (22)) while ∆norm(n) and Θnorm(n) derived from LibriSpeech-100 cor￾pus is shown as the blue curve. The y-axis is between 0 and 1 because of the normalization as seen in (22). Algorithm 1 Find n (using second order polynomial) 1: d2 = 2.48 ∗ 10−8 , d1 = −1.76 ∗ 10−4 , d0 = 3.06 ∗ 10−3 2: f2 = 2.37 ∗ 10−8 , f1 = … view at source ↗
Figure 3
Figure 3. Figure 3: Exponential and second order polynomial to represent [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a calculus-based framework for determining the optimal vocabulary size hyper-parameter in end-to-end ASR systems. It extends prior black-box cost-function work on tokenization by fitting a curve to training-derived cost values and applying first- and second-derivative tests to locate the minimum; the authors claim that the resulting vocabulary size improves ASR performance when applied to the Librispeech corpus.

Significance. If the empirical demonstration holds with quantitative validation, the approach would supply a systematic, derivative-based procedure for tuning a critical hyper-parameter that is currently chosen largely by ad-hoc means in toolkits such as ESPNet, potentially reducing trial-and-error and improving model efficiency across datasets.

major comments (2)
  1. [Abstract] Abstract: the central claim that the derived vocabulary size 'improves the performance of the ASR' on Librispeech is asserted without any reported WER values, baselines, fitting coefficients, or goodness-of-fit statistics, leaving the empirical utility unsupported by visible evidence.
  2. [Method] Method section (curve-fitting step): the optimal size is located by fitting parameters to training-derived cost values and finding the extremum of that fit; by construction the result is determined by the chosen functional form and training data rather than by an independent test on held-out data, undermining the claim of a general, non-circular optimum.
minor comments (2)
  1. Provide the explicit functional form fitted to the cost data and report quantitative fit metrics (R², residuals) so readers can assess whether the first- and second-derivative tests are applied to a well-behaved curve.
  2. Include the full bibliographic entry for the referenced prior work [1] and clarify how the new calculus step differs from the black-box cost framework it builds upon.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our results and methodology.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the derived vocabulary size 'improves the performance of the ASR' on Librispeech is asserted without any reported WER values, baselines, fitting coefficients, or goodness-of-fit statistics, leaving the empirical utility unsupported by visible evidence.

    Authors: We agree that the abstract should provide quantitative support for the central claim. In the revised manuscript we will add the key WER values achieved with the derived vocabulary size, the corresponding baselines, the fitted coefficients, and the goodness-of-fit statistics so that the empirical utility is evident directly from the abstract. revision: yes

  2. Referee: [Method] Method section (curve-fitting step): the optimal size is located by fitting parameters to training-derived cost values and finding the extremum of that fit; by construction the result is determined by the chosen functional form and training data rather than by an independent test on held-out data, undermining the claim of a general, non-circular optimum.

    Authors: The cost values are computed from the tokenization of the training corpus because vocabulary size directly governs the tokenization step used in ASR training. The curve fit and derivative tests then locate the minimum of this cost function in a systematic way. We will revise the method section to clarify that the vocabulary size identified by this procedure is subsequently used to train an ASR model whose performance is measured on the standard held-out Librispeech test sets, thereby providing an independent evaluation of the resulting WER improvement. revision: partial

Circularity Check

1 steps flagged

Fitted cost curve extremum presented as independently optimal vocabulary size

specific steps
  1. fitted input called prediction [Abstract]
    "we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR."

    The vocabulary size is estimated by fitting a curve to cost-function data derived from the tokenization process and locating its extremum via derivatives; the reported optimum is therefore the direct mathematical consequence of the fitted model rather than an externally validated choice independent of the fit.

full rationale

The derivation fits a curve to cost values computed directly from the tokenization process on training data, then applies first- and second-derivative tests to locate the minimum; the resulting vocabulary size is therefore the mathematical extremum of that fitted function by construction. The paper then reports that this size improves ASR performance on Librispeech, but the identification step itself reduces to properties of the fit rather than an independent test. This constitutes a fitted-input-called-prediction pattern with partial circularity; the central claim retains some external validation via the held-out ASR evaluation, preventing a higher score.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the prior cost function being evaluable across vocabulary sizes, the existence of a smooth differentiable curve that can be fitted, and the assumption that the calculus-derived minimum aligns with empirical ASR gains.

free parameters (1)
  • curve fitting coefficients
    Parameters of the curve fitted to cost values computed from the training corpus at different vocabulary sizes.
axioms (1)
  • domain assumption The tokenization cost function is a smooth, twice-differentiable function of vocabulary size
    Required to apply first and second derivative tests to locate a minimum.

pith-pipeline@v0.9.0 · 5569 in / 1344 out tokens · 52655 ms · 2026-05-15T02:04:22.903529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system,

    S. K. Kopparapu and A. Panda, “A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system,” inPro- ceedings of the 2024 International Conference on Pattern Recognition (ICPR), Kolkata, India, December 1–5 2024

  2. [2]

    Librispeech ASR corpus: train-clean-100,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech ASR corpus: train-clean-100,” https://www.openslr.org/resources/12/ train-clean-100.tar.gz, 2015, accessed: 2024-06-26

  3. [3]

    D. C. Montgomery, E. A. Peck, and G. G. Vining,Introduction to Linear Regression Analysis, 5th ed. Hoboken, NJ: John Wiley & Sons,

  4. [4]

    Available: https://www.wiley.com/en-us/Introduction+ to+Linear+Regression+Analysis,+5th+Edition-p-9781119578727

    [Online]. Available: https://www.wiley.com/en-us/Introduction+ to+Linear+Regression+Analysis,+5th+Edition-p-9781119578727

  5. [5]

    ESPnet: End-to-End Speech Processing Toolkit,

    S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduch- intala, and T. Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” inProc. Interspeech 2018, 2018, pp. 2207–2211

  6. [6]

    Librispeech ASR corpus: test-clean-100,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech ASR corpus: test-clean-100,” https://www.openslr.org/resources/12/test-clean. tar.gz, 2015, accessed: 2024-06-26

  7. [7]

    Librispeech ASR corpus: test-other-100,

    ——, “Librispeech ASR corpus: test-other-100,” https://www.openslr. org/resources/12/test-other.tar.gz, 2015, accessed: 2024-06-26

  8. [8]

    Conformer: Convolution- augmented transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented transformer for speech recognition,” inInterspeech 2020, 2020, pp. 5036–5040

  9. [9]

    Adam: A method for stochastic optimiza- tion,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2015

  10. [10]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017

  11. [11]

    Audio augmentation for speech recognition,

    T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” inProc. Interspeech 2015, 2015, pp. 3586–3589

  12. [12]

    SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,

    D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” inProc. Interspeech 2019, 2019, pp. 2613–2617. APPENDIXA EXPONENTIAL ANDSECOND ORDERPOLYNOMIALFIT As mentioned,∆ exp(n)(13) andΘ exp(n)(14) represent the second order polynomial with an exponen...

  13. [13]

    Table IV shows a list ofα 1,2,3’s which result inn= 300

    which satisfyn≈300(Line 3, Algorithm 3). Table IV shows a list ofα 1,2,3’s which result inn= 300. While different values ofα 1,2,3’s (example(23.01,−31.56,0.02)and (263571.00,−361185.85,246.75)) are valid it can be seen that the ratios α1 α3 , α2 α3 , and α3 α3 are quite close to each other (see Table IV). n= 300 Initial (α1, α2, α3) α1 α2 α3 α1 α3 α2 α3 ...