arxiv: 2605.02804 · v2 · submitted 2026-05-04 · 📡 eess.AS · cs.IR

Recognition: 2 theorem links

· Lean Theorem

Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

Jim O'Regan , Jens Edlund

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:43 UTC · model grok-4.3

classification 📡 eess.AS cs.IR

keywords speech embeddingsfactor partitioningmulti-axis similarityattribute-conditioned retrievalspeaker suppressioncontent matchingcross-corpus retrieval

0 comments

The pith

Speech utterances can be mapped to a single vector whose subspaces hold separate attributes like content and speaker, so similarity can be tuned to include or exclude specific axes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a factor-partitioned embedding method that keeps different speech attributes in distinct parts of one vector instead of mixing them. A shared encoder produces the base representation, then separate linear heads project it into subspaces trained to capture one attribute each. Similarity is then calculated as a weighted sum of cosines across those subspaces, with signs and weights chosen to emphasize or suppress particular attributes. This setup is tested on retrieval tasks where the goal is to match what was said while ignoring who said it, or the reverse, across different recording conditions.

Core claim

We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval by computing similarity as a signed weighted sum over per-axis cosine scores.

What carries the argument

Factor-partitioned embeddings: one vector whose subspaces are produced by per-axis linear projection heads on a shared acoustic encoder, each head trained to isolate a single attribute.

If this is right

Retrieval can suppress same-speaker bias while matching semantic content across corpora.
Users can choose to emphasize style or dialect by weighting the corresponding axis higher.
Cross-condition searches become possible by down-weighting axes tied to recording differences.
The same vector supports both joint and selective attribute matching without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on longer or noisier speech to check if subspace isolation holds under more variable conditions.
Similar partitioning might be applied to other sequential data like text or music to allow attribute-controlled search.
If subspaces prove stable, the method could support incremental addition of new axes without rebuilding the full embedding.

Load-bearing premise

Training separate linear heads on shared-label pairs or distillation will keep each subspace aligned to only its intended attribute with little leakage to other axes.

What would settle it

An experiment that measures whether cosine similarity in one subspace still correlates strongly with an attribute assigned to a different subspace after training.

read the original abstract

Speech encodes multiple simultaneous attributes -- linguistic content, speaker identity, dialect, gender --that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how -- or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions. Code is available at: https://github.com/jimregan/spoken-sentence-transformers

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Factor-partitioned embeddings give a workable way to weight speech attributes separately in retrieval, but linear heads on a shared encoder may not fully isolate them.

read the letter

The paper's main contribution is a single embedding vector split into subspaces, one per speech attribute, produced by a shared acoustic encoder plus independent linear projection heads. Each head is trained with distillation or contrastive loss on pairs that match on one attribute. Retrieval then uses a signed weighted sum of the per-axis cosine scores, so you can emphasize content while down-weighting speaker or dialect. That framing is distinct from standard single-vector or multi-encoder setups, and the code is public, which helps.

Referee Report

2 major / 1 minor

Summary. The paper proposes a factor-partitioned embedding framework for speech utterances. A shared acoustic encoder produces features that are fed to independent linear projection heads, each trained via distillation from a specialist teacher or contrastive loss over shared-label pairs, to create subspaces corresponding to distinct attributes (e.g., linguistic content, speaker identity). Similarity is computed as a signed weighted sum of per-axis cosine scores, enabling retrieval that can jointly consider or explicitly suppress specific attributes. The approach is evaluated via cross-corpus retrieval on corpora sharing Harvard sentence prompts, with the claim that signed axis weighting suppresses same-speaker bias while surfacing semantically matched utterances across conditions.

Significance. If the subspaces reliably isolate attributes without cross-axis leakage, the framework offers a compact, single-vector representation that supports flexible, attribute-conditioned speech retrieval and analysis. This could be useful for tasks requiring content matching independent of speaker or recording conditions. The open code repository is a positive factor for reproducibility.

major comments (2)

[Evaluation] The evaluation description (cross-corpus retrieval over Harvard-sentence corpora) is high-level and provides no quantitative metrics, baselines, error analysis, or statistical details. Without these, it is not possible to verify whether the signed weighted cosine sum actually achieves the claimed attribute suppression or separation.
[Method] The central construction relies on per-axis linear projection heads producing subspaces that correspond to distinct attributes without leakage. If the shared encoder produces non-linearly entangled features (common in acoustic models), the linear heads cannot isolate axes, so the signed weighted sum will still mix attributes when one is down-weighted. This assumption is load-bearing for the retrieval claims but receives no empirical test or discussion.

minor comments (1)

[Abstract] The abstract states that the method 'demonstrates' bias suppression but does not preview any specific metrics or effect sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate revisions to the manuscript.

read point-by-point responses

Referee: [Evaluation] The evaluation description (cross-corpus retrieval over Harvard-sentence corpora) is high-level and provides no quantitative metrics, baselines, error analysis, or statistical details. Without these, it is not possible to verify whether the signed weighted cosine sum actually achieves the claimed attribute suppression or separation.

Authors: We agree the original evaluation was high-level. The revised manuscript adds quantitative retrieval metrics (MRR and P@10 for content-matched retrieval under speaker suppression), comparisons to single-vector baselines (e.g., Wav2Vec 2.0 and HuBERT), error analysis of residual same-speaker retrievals, and bootstrap confidence intervals with paired significance tests. revision: yes
Referee: [Method] The central construction relies on per-axis linear projection heads producing subspaces that correspond to distinct attributes without leakage. If the shared encoder produces non-linearly entangled features (common in acoustic models), the linear heads cannot isolate axes, so the signed weighted sum will still mix attributes when one is down-weighted. This assumption is load-bearing for the retrieval claims but receives no empirical test or discussion.

Authors: The concern is valid: linear heads cannot guarantee isolation if the encoder features remain entangled. Our training objectives (distillation and contrastive losses) are intended to drive subspace specialization, but we have added a dedicated discussion of potential cross-axis leakage together with an empirical check of inter-subspace cosine correlations on held-out data to quantify the degree of separation achieved. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework is a novel construction with external objectives

full rationale

The paper introduces a factor-partitioned embedding framework via a shared acoustic encoder and per-axis linear projection heads trained with distillation or contrastive objectives on shared-label pairs. No equations, derivations, or self-citations are presented that reduce the claimed attribute isolation or signed weighted similarity to fitted parameters by construction or to prior self-referential results. The evaluation on cross-corpus retrieval over Harvard sentence corpora supplies an external benchmark, keeping the central claims independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that speech attributes are separable via linear projections after a shared encoder, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (1)

domain assumption Distinct speech attributes can be isolated into separate linear subspaces of a shared embedding vector through distillation or contrastive training
Invoked in the description of the per-axis projection heads and their training objectives

pith-pipeline@v0.9.0 · 5446 in / 1252 out tokens · 53105 ms · 2026-05-11T00:43:13.713548+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. ... similarity is computed as a signed weighted sum over per-axis cosine scores
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The resulting embeddings support attribute-conditioned retrieval

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Introduction Speech, in contrast with text, is inherently variable. We never say the same thing the same way twice: different speakers say the same thing differently, and even the same speaker says the same thing differently at different times, due to a variety of fac- tors such as mood, context, environment, and communicative intent. Beyond linguistic co...
[3]

extend content/timbre separation to three factors by adding prosody, using a V AE with restricted channel capacity to force disentanglement without explicit supervision. Their approach is generative and unsupervised; ours is discrimina- tive and supervision-guided, using pre-trained teacher mod- els to provide independent training signal for each axis. Th...
[4]

Representation Learning with Contrastive Predictive Coding

Method 3.1. Architecture The model follows the SentenceTransformers pipeline: 1.Acoustic encoder: a frozen or fine-tuned HuggingFace audio model (default: WavLM-base-plus [6]) maps raw waveform frames to a sequence of hidden states. 2.Pooling: mean pooling over the frame sequence produces a single vector. 3.Multi-axis projection: a set of linear projectio...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Benchmarks such as SUPERB [8] evaluate speech embeddings under individual task metrics (e.g., speaker verification EER, content WER), but not controllable com- binations

Evaluation: cross-corpus retrieval Existing evaluation frameworks do not directly measure multi- axis similarity. Benchmarks such as SUPERB [8] evaluate speech embeddings under individual task metrics (e.g., speaker verification EER, content WER), but not controllable com- binations. MSEB [35] provides large-scale retrieval evalua- tion across diverse sou...
[6]

In-domain semantic matching: Recovering missing labels for VCTK speaker “p315” by searching an index of VCTK speakers with labels
[7]

Out-of-domain semantic matching: Evaluating cross-corpus retrieval (rehasp→OSR); 20 of the sentences are common to both sets
[8]

easy” same-speaker matches in favour of “hard

Out-of-domain semantic matching: Using negative weight- ing on the speaker axis to force the model to ignore “easy” same-speaker matches in favour of “hard” cross-speaker se- mantic matches. Because the goal is to support controllable similarity rather than optimise a single metric, evaluation focuses on whether axis weighting changes retrieval behaviour ...
[9]

Results and observations We evaluate eight model variants in a controlled ablation. All models usesemantic:384(matching the MiniLM teacher exactly, eliminating any alignment matrix on the semantic axis), except for the PCA baseline which usessemantic:256tar- gets derived by PCA projection of the MiniLM embedding space. No gender axis is included in any va...
[10]

Conclusion We presented a factor-partitioned embedding framework for speech that supports similarity search across multiple attribute axes simultaneously. Axis-specific projection heads are trained via teacher distillation or contrastive objectives over shared- label pairs, producing a single concatenated embedding that can be queried with signed per-axis...
[11]

Can the definition of each speaker be expected to come from the laboratory in the next decades?

F. Nolan, “Can the definition of each speaker be expected to come from the laboratory in the next decades?” inProceedings of the XIIIth International Congress of Phonetic Sciences, vol. 3, Stock- holm, Sweden, Aug. 1995, pp. 130–137

1995
[12]

The CMU Arctic speech databases,

J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in5th ISCA Workshop on Speech Synthesis (SSW 5), 2004, pp. 223–224

2004
[13]

The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,

C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,” in2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Re- search and Evaluation (O-COCOSDA/CASLRE), 2013, pp. 1–4

2013
[14]

wav2vec 2.0: a framework for self-supervised learning of speech representa- tions,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representa- tions,” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

2020
[15]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, p. 3451–3460, Oct. 2021. [Online]. Available: https://doi.org/10.1109/TASLP.2021.3122291

work page doi:10.1109/taslp.2021.3122291 2021
[16]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[17]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds...

2019
[18]

SUPERB: Speech Processing Universal PERformance Benchmark,

S. wen Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” inInterspeech 2021, 2021, pp. 1194–1198

2021
[19]

X-Vectors: Robust DNN embeddings for speaker recogni- tion,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-Vectors: Robust DNN embeddings for speaker recogni- tion,” inProc. ICASSP 2018, 2018, pp. 5329–5333

2018
[20]

Deep convolutional acoustic word embeddings using word-pair side information,

H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acoustic word embeddings using word-pair side information,” in Proc. ICASSP 2016, 2016, pp. 4950–4954

2016
[21]

Multi-view recurrent neural acoustic word embeddings,

W. He, W. Wang, and K. Livescu, “Multi-view recurrent neural acoustic word embeddings,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https: //openreview.net/forum?id=rJxDkvqee

2017
[22]

Audio Word2Vec: Unsupervised learning of audio segment rep- resentations using sequence-to-sequence autoencoder,

Y .-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y . Lee, and L.-S. Lee, “Audio Word2Vec: Unsupervised learning of audio segment rep- resentations using sequence-to-sequence autoencoder,” inInter- speech 2016, 2016, pp. 765–769

2016
[23]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” inProc. EMNLP- IJCNLP 2019, K. Inui, J. Jiang, V . Ng, and X. Wan, Eds. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3982–3992. [Online]. Available: https://aclanthology.org/D19-1410/

2019
[24]

Conditional similarity networks,

A. Veit, S. Belongie, and T. Karaletsos, “Conditional similarity networks,” in2017 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2017, pp. 1781–1789

2017
[25]

SpeechDPR: End-to-end spo- ken passage retrieval for open-domain spoken question answer- ing,

C.-J. Lin, G.-T. Lin, Y .-S. Chuang, W.-L. Wu, S.-W. Li, A. Mo- hamed, H.-Y . Lee, and L.-S. Lee, “SpeechDPR: End-to-end spo- ken passage retrieval for open-domain spoken question answer- ing,” inProc. ICASSP 2024, 2024, pp. 12 476–12 480

2024
[26]

Speech retrieval-augmented generation without automatic speech recognition,

D. J. Min, K. Mundnich, A. Lapastora, E. Soltanmohammadi, S. Ronanki, and K. Han, “Speech retrieval-augmented generation without automatic speech recognition,” inProc. ICASSP 2025, 2025, pp. 1–5

2025
[27]

CLAP learning audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. A. Ismail, and H. Wang, “CLAP learning audio concepts from natural language supervision,” in Proc. ICASSP 2023, 2023, pp. 1–5

2023
[28]

Unsupervised learning of dis- entangled and interpretable representations from sequential data,

W.-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised learning of dis- entangled and interpretable representations from sequential data,” inProceedings of the 31st International Conference on Neural In- formation Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 1876–1887

2017
[29]

Unsupervised speech decomposition via triple information bot- tleneck,

K. Qian, Y . Zhang, S. Chang, D. Cox, and M. Hasegawa-Johnson, “Unsupervised speech decomposition via triple information bot- tleneck,” inProceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020

2020
[30]

ContentVec: An improved self-supervised speech representation by disentangling speakers,

K. Qian, Y . Zhang, H. Gao, J. Ni, C.-I. Lai, D. Cox, M. Hasegawa-Johnson, and S. Chang, “ContentVec: An improved self-supervised speech representation by disentangling speakers,” inProceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, ...

2022
[31]

SpeechTripleNet: End-to-end disentangled speech representation learning for content, timbre and prosody,

H. Lu, X. Wu, Z. Wu, and H. Meng, “SpeechTripleNet: End-to-end disentangled speech representation learning for content, timbre and prosody,” inProceedings of the 31st ACM International Conference on Multimedia, ser. MM ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 2829–2837. [Online]. Available: https: //doi.org/10.1145/3581783.3612485

work page doi:10.1145/3581783.3612485 2023
[32]

beta-V AE: Learning basic visual concepts with a con- strained variational framework,

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glo- rot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-V AE: Learning basic visual concepts with a con- strained variational framework,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=Sy2fzU9gl

2017
[33]

Learning disentangled speech representations with contrastive learning and time-invariant retrieval,

Y . Deng, H. Tang, X. Zhang, N. Cheng, J. Xiao, and J. Wang, “Learning disentangled speech representations with contrastive learning and time-invariant retrieval,” inProc. ICASSP 2024, 2024, pp. 10 271–10 275

2024
[34]

SpeechTokenizer: Unified speech tokenizer for speech language models,

X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “SpeechTokenizer: Unified speech tokenizer for speech language models,” inThe Twelfth International Confer- ence on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=AF9Q8Vip84

2024
[35]

BEST-STD: Bidirectional mamba-enhanced speech tokenization for spoken term detection,

A. Singh, K. Demuynck, and V . Arora, “BEST-STD: Bidirectional mamba-enhanced speech tokenization for spoken term detection,” inProc. ICASSP 2025, 2025, pp. 1–5

2025
[36]

MiniLM: Deep self-attention distillation for task-agnostic com- pression of pre-trained transformers,

W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “MiniLM: Deep self-attention distillation for task-agnostic com- pression of pre-trained transformers,” inProceedings of the 34th International Conference on Neural Information Processing Sys- tems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

2020
[37]

CSTR VCTK Cor- pus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),

J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Cor- pus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019

2019
[38]

Open- source multi-speaker corpora of the English accents in the British Isles,

I. Demirsahin, O. Kjartansson, A. Gutkin, and C. Rivera, “Open- source multi-speaker corpora of the English accents in the British Isles,” inProceedings of The 12th Language Resources and Evaluation Conference (LREC). Marseille, France: European Language Resources Association (ELRA), May 2020, pp. 6532–

2020
[39]

Available: https://www.aclweb.org/anthology/ 2020.lrec-1.804

[Online]. Available: https://www.aclweb.org/anthology/ 2020.lrec-1.804

2020
[40]

IEEE recommended practice for speech quality mea- surements,

IEEE, “IEEE recommended practice for speech quality mea- surements,”IEEE Transactions on Audio and Electroacoustics, vol. 17, no. 3, pp. 225–246, 1969

1969
[41]

Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech,

G. E. Henter, T. Merritt, M. Shannon, C. Mayo, and S. King, “Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech,” inInterspeech 2014, 2014, pp. 1504–1508

2014
[42]

WhisperX: Time- accurate speech transcription of long-form audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “WhisperX: Time- accurate speech transcription of long-form audio,” inInterspeech 2023, 2023, pp. 4489–4493

2023
[43]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. B ´echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J....

2020
[44]

CommonAccent: Exploring large acoustic pretrained models for accent classification based on Common V oice,

J. Zuluaga-Gomez, S. Ahmed, D. Visockas, and C. Subakan, “CommonAccent: Exploring large acoustic pretrained models for accent classification based on Common V oice,” inInterspeech 2023, 2023, pp. 5291–5295

2023
[45]

Generalized end- to-end loss for speaker verification,

L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end- to-end loss for speaker verification,” inProc. ICASSP 2018, 2018, pp. 4879–4883

2018
[46]

Massive Sound Embedding Benchmark (MSEB),

G. Heigold, E. Variani, T. Bagby, C. Allauzen, J. Ma, S. Kumar, and M. Riley, “Massive Sound Embedding Benchmark (MSEB),” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,
[47]

Available: https://openreview.net/forum?id= X0juYgFVng

[Online]. Available: https://openreview.net/forum?id= X0juYgFVng