Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

Joseph Keshet; Meidan Zehavi; Rotem Rousso; Roy Weber

arxiv: 2606.10675 · v1 · pith:MHEASA2Mnew · submitted 2026-06-09 · 💻 cs.CL · eess.AS

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

Roy Weber , Meidan Zehavi , Rotem Rousso , Joseph Keshet This is my paper

Pith reviewed 2026-06-27 13:09 UTC · model grok-4.3

classification 💻 cs.CL eess.AS

keywords multilingual forced alignmentword-level alignmentself-supervised speech representationsdynamic programming decoderMMS modelUnSupSegspeech boundary detection

0 comments

The pith

The proposed encoder-decoder fuses MMS and UnSupSeg representations to produce word alignments that generalize to unseen languages without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method for word-level forced alignment that combines two self-supervised speech representations in a learned encoder and decodes boundaries via dynamic programming. The encoder estimates word-boundary probabilities over long contexts while the decoder integrates segmental features to produce final alignments. Trained only on English corpora, the system is evaluated on three additional languages to test cross-lingual transfer. A sympathetic reader would care because the results suggest that pre-trained multilingual speech models can support alignment at scale across hundreds of languages.

Core claim

The central claim is that an alignment encoder fusing Massively Multilingual Speech model outputs with UnSupSeg phoneme-boundary features, paired with a learned dynamic-programming decoder, yields word boundaries that outperform the Montreal Forced Aligner and MMS-based baselines on TIMIT and Buckeye and remain competitive on Dutch, German, and Hebrew, indicating that the approach can extend to the full set of languages covered by MMS without language-specific retraining.

What carries the argument

The alignment encoder that learns to fuse MMS and UnSupSeg representations into word-boundary probability sequences, together with the learned dynamic programming decoder that combines those probabilities with segmental features to recover the final boundary sequence.

If this is right

The model outperforms Montreal Forced Aligner and MMS-based alignment on the TIMIT and Buckeye English datasets.
On the unseen languages Dutch, German, and Hebrew the model achieves results at or above the level of existing aligners.
The same trained system can be applied directly to any of the 1100+ languages covered by MMS without additional supervised training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same architecture could be used to bootstrap alignment for low-resource languages where no labeled word boundaries exist.
Replacing the current decoder with a differentiable approximation might allow end-to-end gradient training of the entire pipeline.
Evaluating the method on languages whose phoneme inventories differ sharply from the training set would test the limits of the representation fusion.

Load-bearing premise

The pre-trained MMS and UnSupSeg representations, once fused, already contain enough word-boundary signal to generalize to languages absent from the iterative training on TIMIT and Buckeye.

What would settle it

If the model falls below the performance of MFA or MMS alignment on a fourth language outside the MMS training distribution, the generalization claim would be falsified.

read the original abstract

We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fuses MMS and UnSupSeg in a learned encoder plus learned DP decoder, beats baselines on English data, and stays competitive on three unseen languages, but the scaling claim to 1100+ languages rests on thin evidence.

read the letter

The paper combines MMS and UnSupSeg in a learned encoder that estimates word-boundary probabilities, then runs those through a learned dynamic programming decoder. Trained on TIMIT and Buckeye, it beats standard aligners on those sets and matches or exceeds them on Dutch, German, and Hebrew.

What is new is the fusion step and the replacement of standard DP with a learned version that incorporates segmental features from both representations. The results on the two English datasets are solid, with consistent outperformance. The unseen-language results are the part that matters most for the multilingual claim.

The main concern is whether the learned components actually generalize. The three test languages are a narrow sample, and the training happened only on English. If the decoder is learning English-like boundary patterns, the performance on Hebrew might not predict how it would do on, say, a tonal language or one with very different syllable structure. The paper does not appear to include ablations that isolate the contribution of the learned encoder versus the base MMS features.

The comparisons look fair, with no sign of circular evaluation.

This work is aimed at speech processing researchers who need alignment tools that work across languages without new supervision. Anyone building multilingual datasets or ASR systems for many languages would want to see these numbers.

It is worth sending to peer review. The core idea is straightforward and the cross-language results are worth verifying in detail, even if the scaling claim needs more support.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a multilingual word-level forced alignment method consisting of an alignment encoder that fuses representations from the Massively Multilingual Speech (MMS) model and the self-supervised UnSupSeg phoneme boundary detector to estimate word-boundary probabilities over long contexts, together with a learned dynamic programming decoder that combines these outputs with segmental features. The system is trained iteratively on the English TIMIT and Buckeye corpora, where it outperforms the Montreal Forced Aligner (MFA) and MMS-based baselines. On three unseen languages (Dutch, German, Hebrew) it reports performance that is consistently better than or on par with existing approaches, with the implication that the method can scale to the 1100+ languages supported by MMS without further training.

Significance. If the empirical results and generalization hold, the work offers a practical route to word-level alignment for a large number of languages by leveraging existing self-supervised multilingual representations and adding learned fusion and decoding components. The use of a learned DP decoder and iterative training on boundary probabilities constitutes a clear technical contribution over purely pre-trained or rule-based aligners. The potential for zero-shot transfer is valuable for low-resource speech applications, though its impact depends on the breadth of the supporting evidence.

major comments (2)

[Abstract and unseen-languages results] Abstract and unseen-languages evaluation: the central claim that the approach 'indicates its potential to scale to 1100+ languages supported by MMS without further training' rests on results from only three unseen languages (Dutch, German, Hebrew). These three do not span the typological range of the MMS inventory (e.g., tonal, agglutinative, or phonotactically divergent languages), and no ablation or analysis is described showing that the learned encoder/decoder add language-agnostic signal beyond the already-multilingual MMS features. This directly affects the load-bearing generalization argument.
[Methods and experimental setup] Training and evaluation protocol: the iterative training is performed exclusively on TIMIT and Buckeye (English); the manuscript does not report whether the fusion weights or DP parameters were frozen or adapted when evaluating the unseen languages, nor does it provide quantitative boundary-error metrics, confidence intervals, or error analysis that would allow independent verification of the 'better than or on par' claim.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one concrete quantitative result (e.g., boundary error rate or F1) for the unseen languages rather than the qualitative statement alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important limitations in the scope of our generalization claims and the clarity of our experimental protocol. We address each point below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and unseen-languages results] Abstract and unseen-languages evaluation: the central claim that the approach 'indicates its potential to scale to 1100+ languages supported by MMS without further training' rests on results from only three unseen languages (Dutch, German, Hebrew). These three do not span the typological range of the MMS inventory (e.g., tonal, agglutinative, or phonotactically divergent languages), and no ablation or analysis is described showing that the learned encoder/decoder add language-agnostic signal beyond the already-multilingual MMS features. This directly affects the load-bearing generalization argument.

Authors: We agree that results on only three languages provide limited support for broad generalization claims and that Dutch, German, and Hebrew do not cover the full typological diversity of the MMS inventory. We will revise the abstract and conclusion to replace 'indicates its potential' with more cautious phrasing such as 'suggests potential for scaling' and will add a limitations paragraph noting the restricted language sample and absence of explicit ablations isolating the learned components' contribution beyond MMS features. The current results still show consistent performance across the tested language families, but we accept that stronger evidence would require additional languages. revision: partial
Referee: [Methods and experimental setup] Training and evaluation protocol: the iterative training is performed exclusively on TIMIT and Buckeye (English); the manuscript does not report whether the fusion weights or DP parameters were frozen or adapted when evaluating the unseen languages, nor does it provide quantitative boundary-error metrics, confidence intervals, or error analysis that would allow independent verification of the 'better than or on par' claim.

Authors: The fusion weights and DP decoder parameters remained frozen for the unseen-language evaluations to demonstrate zero-shot transfer. We will add an explicit statement to this effect in the methods section. We will also incorporate quantitative boundary-error metrics, confidence intervals on the reported scores, and a concise error analysis into the revised results section or appendix to support independent verification. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with direct test-set comparisons

full rationale

The paper describes an encoder-decoder architecture trained iteratively on TIMIT and Buckeye (English) and evaluated via standard performance metrics on held-out portions of those corpora plus three additional unseen languages. No equations, parameters, or claims are shown to reduce to their own inputs by construction, no fitted quantities are relabeled as predictions, and no load-bearing steps rely on self-citations or imported uniqueness results. The generalization statement to 1100+ languages is an empirical extrapolation rather than a formal derivation, leaving the reported results self-contained against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the approach rests on the effectiveness of two external pre-trained models and the assumption that iterative training on English data transfers. No new free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption MMS and UnSupSeg representations contain usable word-boundary information that can be fused for alignment
The encoder is built on the premise that these two self-supervised models supply complementary signals for word boundaries.
domain assumption Learned dynamic programming can be trained to produce accurate alignments from encoder outputs and segmental features
The decoder stage assumes the learned DP component improves upon standard dynamic programming.

pith-pipeline@v0.9.1-grok · 5687 in / 1354 out tokens · 21710 ms · 2026-06-27T13:09:10.016460+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Introduction Accurate word-level forced alignment is a fundamental compo- nent in speech and language processing. Precise temporal align- ment between audio signals and textual transcriptions enables fine-grained analysis in linguistics, including the study of pho- netics, phonology, prosody, and dialectal variation across lan- guages. Beyond linguistic r...
[2]

The paper concludes with a comprehensive empirical eval- uation

has established itself as one of the leading toolkits for word- and phoneme-level alignment, consistently ranking among the top-performing systems in recent evaluations [4]. The paper concludes with a comprehensive empirical eval- uation. We first detail the hyperparameter tuning and model se- lection procedure, and then report results on multiple manuall...
[3]

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

Method We assume that a waveform consisting ofTsamples is trans- formed into a sequence ofLframes, with the frame duration of 10 msec. The speech utterance is represented asX= (x1, . . . ,xL), where each framex l ∈R d for1≤l≤Lis ad-dimensional feature vector, and thusX∈R L×d. Letw= (w 1, . . . , wK )denote the sequence of words in the utterance, whereKis ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Datasets We trained and evaluated the proposed method on the TIMIT

Experimental Results 3.1. Datasets We trained and evaluated the proposed method on the TIMIT
[5]

These datasets provide manually aligned phonetic and orthographic transcriptions for read speech (5.1 hours) and conversational speech (40 hours), respectively

and Buckeye [7] speech corpora. These datasets provide manually aligned phonetic and orthographic transcriptions for read speech (5.1 hours) and conversational speech (40 hours), respectively. For each corpus, the data were partitioned at the speaker level into training, validation, and test sets using an 80/10/10 split. We evaluate the model on TIMIT and...
[6]

and MF A [5] on the Hebrew, German - PHONDAT, and Dutch - IF A Corpus datasets Alignment accuracy [%] Dataset Modelt≤10t≤25t≤50t≤100 Hebrew MMS 14.3 41.376.5 94.7 MW A 39.7 61.173.6 81.4 Dutch - IFA Corpus MFA 4.7 7.3 11.6 19 MMS 16 37.9 62.976.6 MW A 29 48.4 65.376.5 German - PHONDAT MFA 29.965.482.194.3 MMS 21.8 44.3 74.9 91.8 MW A 32.864.284.793.5
[7]

It does not rely on phonemes and therefore eliminates the need for G2P conversions

Discussion We proposed a method for accurate word alignment based on the MMS and an accurate self-supervised phoneme boundary representation (UnSupSeg). It does not rely on phonemes and therefore eliminates the need for G2P conversions. It proposes a potential replacement for MFA that is based on an HMM-GMM constraint model with G2P. We demonstrate the ef...
[8]

2219843 and BSF Grant No

Acknowledgments This work was supported by NSF DRL Grant No. 2219843 and BSF Grant No. 2022618. We also thank Rob van Son for his guidance and support with the IFA Corpus
[9]

They did not contribute to the scientific content, analysis, or core writing of the paper, and no AI system is listed as a co-author

Generative AI Use Disclosure Generative AI tools were used solely for language editing and manuscript polishing. They did not contribute to the scientific content, analysis, or core writing of the paper, and no AI system is listed as a co-author
[10]

wav2vec 2.0: a framework for self-supervised learning of speech representa- tions,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representa- tions,” inProceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), 2020

2020
[11]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

2021
[12]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,”arXiv preprint arXiv:2212.04356, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Tradition or innovation: A comparison of modern ASR methods for forced alignment,

R. Rousso, E. Cohen, J. Keshet, and E. Chodroff, “Tradition or innovation: A comparison of modern ASR methods for forced alignment,” inThe 25th Annual Conference of the International Speech Communication Association (Interspeech), 2024

2024
[14]

Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,” inProceedings of the 18th Annual Conference of the International Speech Communication Association (Inter- speech), Aug. 2017, pp. 498–502

2017
[15]

DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA, Tech. Rep. 93, Feb. 1993

1993
[16]

The Buckeye corpus of conversational speech: labeling conven- tions and a test of transcriber reliability,

M. A. Pitt, K. Johnson, E. Hume, S. Kiesling, and W. Raymond, “The Buckeye corpus of conversational speech: labeling conven- tions and a test of transcriber reliability,”Speech Communication, vol. 45, no. 1, pp. 89–95, Jan. 2005

2005
[17]

Automatic tools for analyzing spoken hebrew,

A. Ben-Shalom, D. Modan, A. Laufer, and J. Keshet, “Automatic tools for analyzing spoken hebrew,” inThe 2014 Afeka Conference for Speech Processing, 2014

2014
[18]

The IFA corpus: A phonemically segmented Dutch open source speech database,

R. V . Son, D. Binnenpoorte, H. van den Heuvel, and L. Pols, “The IFA corpus: A phonemically segmented Dutch open source speech database,” inProc. EUROSPEECH 2001, Aalborg, Denmark, vol. 3, 2001, pp. 2051–2054. [Online]. Available: https://zenodo.org/records/14904090

work page arXiv 2001
[19]

Theoretical princi- ples concerning segmentation, labelling strategies and levels of categorical annotation for spoken language database systems,

H. G. Tillmann and B. Pompino-Marschall, “Theoretical princi- ples concerning segmentation, labelling strategies and levels of categorical annotation for spoken language database systems,” in Proc. Eurospeech 1993, 1993, pp. 1691–1694

1993
[20]

Self-supervised contrastive learning for unsupervised phoneme segmentation,

F. Kreuk, J. Keshet, and Y . Adi, “Self-supervised contrastive learning for unsupervised phoneme segmentation,” inProceed- ings of the 21th Annual Conference of the International Speech Communication Association (Interspeech), 2020

2020
[21]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024

2024
[22]

Very deep convolutional net- works for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” inProceedings of the International Conference on Learning Representations (ICLR), 2015

2015
[23]

Conformer: Convolution-augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proceedings of the 21th Annual Conference of the International Speech Communication Association (Interspeech), 2020

2020
[24]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” inProceedings of the IEEE interna- tional conference on computer vision, 2017, pp. 2980–2988

2017
[25]

A large margin algorithm for speech-to-phoneme and music-to- score alignment,

J. Keshet, S. Shalev-Shwartz, Y . Singer, and D. Chazan, “A large margin algorithm for speech-to-phoneme and music-to- score alignment,”IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 15, no. 8, pp. 2373–2382, 2007

2007
[26]

WhisperX: Time- accurate speech transcription of long-form audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “WhisperX: Time- accurate speech transcription of long-form audio,” inProceedings of the 24th Annual Conference of the International Speech Com- munication Association (Interspeech), 2023

2023
[27]

Canary-1b-v2 & parakeet-tdt-0.6b-v3: Efficient and high-performance models for multilingual asr and ast,

M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bart- ley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1b-v2 & parakeet-tdt-0.6b-v3: Efficient and high-performance models for multilingual asr and ast,”arXiv preprint arXiv:2509.14128, 2025

work page arXiv 2025

[1] [1]

Introduction Accurate word-level forced alignment is a fundamental compo- nent in speech and language processing. Precise temporal align- ment between audio signals and textual transcriptions enables fine-grained analysis in linguistics, including the study of pho- netics, phonology, prosody, and dialectal variation across lan- guages. Beyond linguistic r...

[2] [2]

The paper concludes with a comprehensive empirical eval- uation

has established itself as one of the leading toolkits for word- and phoneme-level alignment, consistently ranking among the top-performing systems in recent evaluations [4]. The paper concludes with a comprehensive empirical eval- uation. We first detail the hyperparameter tuning and model se- lection procedure, and then report results on multiple manuall...

[3] [3]

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

Method We assume that a waveform consisting ofTsamples is trans- formed into a sequence ofLframes, with the frame duration of 10 msec. The speech utterance is represented asX= (x1, . . . ,xL), where each framex l ∈R d for1≤l≤Lis ad-dimensional feature vector, and thusX∈R L×d. Letw= (w 1, . . . , wK )denote the sequence of words in the utterance, whereKis ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Datasets We trained and evaluated the proposed method on the TIMIT

Experimental Results 3.1. Datasets We trained and evaluated the proposed method on the TIMIT

[5] [5]

These datasets provide manually aligned phonetic and orthographic transcriptions for read speech (5.1 hours) and conversational speech (40 hours), respectively

and Buckeye [7] speech corpora. These datasets provide manually aligned phonetic and orthographic transcriptions for read speech (5.1 hours) and conversational speech (40 hours), respectively. For each corpus, the data were partitioned at the speaker level into training, validation, and test sets using an 80/10/10 split. We evaluate the model on TIMIT and...

[6] [6]

and MF A [5] on the Hebrew, German - PHONDAT, and Dutch - IF A Corpus datasets Alignment accuracy [%] Dataset Modelt≤10t≤25t≤50t≤100 Hebrew MMS 14.3 41.376.5 94.7 MW A 39.7 61.173.6 81.4 Dutch - IFA Corpus MFA 4.7 7.3 11.6 19 MMS 16 37.9 62.976.6 MW A 29 48.4 65.376.5 German - PHONDAT MFA 29.965.482.194.3 MMS 21.8 44.3 74.9 91.8 MW A 32.864.284.793.5

[7] [7]

It does not rely on phonemes and therefore eliminates the need for G2P conversions

Discussion We proposed a method for accurate word alignment based on the MMS and an accurate self-supervised phoneme boundary representation (UnSupSeg). It does not rely on phonemes and therefore eliminates the need for G2P conversions. It proposes a potential replacement for MFA that is based on an HMM-GMM constraint model with G2P. We demonstrate the ef...

[8] [8]

2219843 and BSF Grant No

Acknowledgments This work was supported by NSF DRL Grant No. 2219843 and BSF Grant No. 2022618. We also thank Rob van Son for his guidance and support with the IFA Corpus

[9] [9]

They did not contribute to the scientific content, analysis, or core writing of the paper, and no AI system is listed as a co-author

Generative AI Use Disclosure Generative AI tools were used solely for language editing and manuscript polishing. They did not contribute to the scientific content, analysis, or core writing of the paper, and no AI system is listed as a co-author

[10] [10]

wav2vec 2.0: a framework for self-supervised learning of speech representa- tions,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representa- tions,” inProceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), 2020

2020

[11] [11]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

2021

[12] [12]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,”arXiv preprint arXiv:2212.04356, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Tradition or innovation: A comparison of modern ASR methods for forced alignment,

R. Rousso, E. Cohen, J. Keshet, and E. Chodroff, “Tradition or innovation: A comparison of modern ASR methods for forced alignment,” inThe 25th Annual Conference of the International Speech Communication Association (Interspeech), 2024

2024

[14] [14]

Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,” inProceedings of the 18th Annual Conference of the International Speech Communication Association (Inter- speech), Aug. 2017, pp. 498–502

2017

[15] [15]

DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA, Tech. Rep. 93, Feb. 1993

1993

[16] [16]

The Buckeye corpus of conversational speech: labeling conven- tions and a test of transcriber reliability,

M. A. Pitt, K. Johnson, E. Hume, S. Kiesling, and W. Raymond, “The Buckeye corpus of conversational speech: labeling conven- tions and a test of transcriber reliability,”Speech Communication, vol. 45, no. 1, pp. 89–95, Jan. 2005

2005

[17] [17]

Automatic tools for analyzing spoken hebrew,

A. Ben-Shalom, D. Modan, A. Laufer, and J. Keshet, “Automatic tools for analyzing spoken hebrew,” inThe 2014 Afeka Conference for Speech Processing, 2014

2014

[18] [18]

The IFA corpus: A phonemically segmented Dutch open source speech database,

R. V . Son, D. Binnenpoorte, H. van den Heuvel, and L. Pols, “The IFA corpus: A phonemically segmented Dutch open source speech database,” inProc. EUROSPEECH 2001, Aalborg, Denmark, vol. 3, 2001, pp. 2051–2054. [Online]. Available: https://zenodo.org/records/14904090

work page arXiv 2001

[19] [19]

Theoretical princi- ples concerning segmentation, labelling strategies and levels of categorical annotation for spoken language database systems,

H. G. Tillmann and B. Pompino-Marschall, “Theoretical princi- ples concerning segmentation, labelling strategies and levels of categorical annotation for spoken language database systems,” in Proc. Eurospeech 1993, 1993, pp. 1691–1694

1993

[20] [20]

Self-supervised contrastive learning for unsupervised phoneme segmentation,

F. Kreuk, J. Keshet, and Y . Adi, “Self-supervised contrastive learning for unsupervised phoneme segmentation,” inProceed- ings of the 21th Annual Conference of the International Speech Communication Association (Interspeech), 2020

2020

[21] [21]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024

2024

[22] [22]

Very deep convolutional net- works for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” inProceedings of the International Conference on Learning Representations (ICLR), 2015

2015

[23] [23]

Conformer: Convolution-augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proceedings of the 21th Annual Conference of the International Speech Communication Association (Interspeech), 2020

2020

[24] [24]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” inProceedings of the IEEE interna- tional conference on computer vision, 2017, pp. 2980–2988

2017

[25] [25]

A large margin algorithm for speech-to-phoneme and music-to- score alignment,

J. Keshet, S. Shalev-Shwartz, Y . Singer, and D. Chazan, “A large margin algorithm for speech-to-phoneme and music-to- score alignment,”IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 15, no. 8, pp. 2373–2382, 2007

2007

[26] [26]

WhisperX: Time- accurate speech transcription of long-form audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “WhisperX: Time- accurate speech transcription of long-form audio,” inProceedings of the 24th Annual Conference of the International Speech Com- munication Association (Interspeech), 2023

2023

[27] [27]

Canary-1b-v2 & parakeet-tdt-0.6b-v3: Efficient and high-performance models for multilingual asr and ast,

M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bart- ley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1b-v2 & parakeet-tdt-0.6b-v3: Efficient and high-performance models for multilingual asr and ast,”arXiv preprint arXiv:2509.14128, 2025

work page arXiv 2025