pith. machine review for the scientific record. sign in

arxiv: 2605.02804 · v2 · submitted 2026-05-04 · 📡 eess.AS · cs.IR

Recognition: 2 theorem links

· Lean Theorem

Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:43 UTC · model grok-4.3

classification 📡 eess.AS cs.IR
keywords speech embeddingsfactor partitioningmulti-axis similarityattribute-conditioned retrievalspeaker suppressioncontent matchingcross-corpus retrieval
0
0 comments X

The pith

Speech utterances can be mapped to a single vector whose subspaces hold separate attributes like content and speaker, so similarity can be tuned to include or exclude specific axes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a factor-partitioned embedding method that keeps different speech attributes in distinct parts of one vector instead of mixing them. A shared encoder produces the base representation, then separate linear heads project it into subspaces trained to capture one attribute each. Similarity is then calculated as a weighted sum of cosines across those subspaces, with signs and weights chosen to emphasize or suppress particular attributes. This setup is tested on retrieval tasks where the goal is to match what was said while ignoring who said it, or the reverse, across different recording conditions.

Core claim

We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval by computing similarity as a signed weighted sum over per-axis cosine scores.

What carries the argument

Factor-partitioned embeddings: one vector whose subspaces are produced by per-axis linear projection heads on a shared acoustic encoder, each head trained to isolate a single attribute.

If this is right

  • Retrieval can suppress same-speaker bias while matching semantic content across corpora.
  • Users can choose to emphasize style or dialect by weighting the corresponding axis higher.
  • Cross-condition searches become possible by down-weighting axes tied to recording differences.
  • The same vector supports both joint and selective attribute matching without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on longer or noisier speech to check if subspace isolation holds under more variable conditions.
  • Similar partitioning might be applied to other sequential data like text or music to allow attribute-controlled search.
  • If subspaces prove stable, the method could support incremental addition of new axes without rebuilding the full embedding.

Load-bearing premise

Training separate linear heads on shared-label pairs or distillation will keep each subspace aligned to only its intended attribute with little leakage to other axes.

What would settle it

An experiment that measures whether cosine similarity in one subspace still correlates strongly with an attribute assigned to a different subspace after training.

read the original abstract

Speech encodes multiple simultaneous attributes -- linguistic content, speaker identity, dialect, gender --that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how -- or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions. Code is available at: https://github.com/jimregan/spoken-sentence-transformers

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a factor-partitioned embedding framework for speech utterances. A shared acoustic encoder produces features that are fed to independent linear projection heads, each trained via distillation from a specialist teacher or contrastive loss over shared-label pairs, to create subspaces corresponding to distinct attributes (e.g., linguistic content, speaker identity). Similarity is computed as a signed weighted sum of per-axis cosine scores, enabling retrieval that can jointly consider or explicitly suppress specific attributes. The approach is evaluated via cross-corpus retrieval on corpora sharing Harvard sentence prompts, with the claim that signed axis weighting suppresses same-speaker bias while surfacing semantically matched utterances across conditions.

Significance. If the subspaces reliably isolate attributes without cross-axis leakage, the framework offers a compact, single-vector representation that supports flexible, attribute-conditioned speech retrieval and analysis. This could be useful for tasks requiring content matching independent of speaker or recording conditions. The open code repository is a positive factor for reproducibility.

major comments (2)
  1. [Evaluation] The evaluation description (cross-corpus retrieval over Harvard-sentence corpora) is high-level and provides no quantitative metrics, baselines, error analysis, or statistical details. Without these, it is not possible to verify whether the signed weighted cosine sum actually achieves the claimed attribute suppression or separation.
  2. [Method] The central construction relies on per-axis linear projection heads producing subspaces that correspond to distinct attributes without leakage. If the shared encoder produces non-linearly entangled features (common in acoustic models), the linear heads cannot isolate axes, so the signed weighted sum will still mix attributes when one is down-weighted. This assumption is load-bearing for the retrieval claims but receives no empirical test or discussion.
minor comments (1)
  1. [Abstract] The abstract states that the method 'demonstrates' bias suppression but does not preview any specific metrics or effect sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate revisions to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] The evaluation description (cross-corpus retrieval over Harvard-sentence corpora) is high-level and provides no quantitative metrics, baselines, error analysis, or statistical details. Without these, it is not possible to verify whether the signed weighted cosine sum actually achieves the claimed attribute suppression or separation.

    Authors: We agree the original evaluation was high-level. The revised manuscript adds quantitative retrieval metrics (MRR and P@10 for content-matched retrieval under speaker suppression), comparisons to single-vector baselines (e.g., Wav2Vec 2.0 and HuBERT), error analysis of residual same-speaker retrievals, and bootstrap confidence intervals with paired significance tests. revision: yes

  2. Referee: [Method] The central construction relies on per-axis linear projection heads producing subspaces that correspond to distinct attributes without leakage. If the shared encoder produces non-linearly entangled features (common in acoustic models), the linear heads cannot isolate axes, so the signed weighted sum will still mix attributes when one is down-weighted. This assumption is load-bearing for the retrieval claims but receives no empirical test or discussion.

    Authors: The concern is valid: linear heads cannot guarantee isolation if the encoder features remain entangled. Our training objectives (distillation and contrastive losses) are intended to drive subspace specialization, but we have added a dedicated discussion of potential cross-axis leakage together with an empirical check of inter-subspace cosine correlations on held-out data to quantify the degree of separation achieved. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework is a novel construction with external objectives

full rationale

The paper introduces a factor-partitioned embedding framework via a shared acoustic encoder and per-axis linear projection heads trained with distillation or contrastive objectives on shared-label pairs. No equations, derivations, or self-citations are presented that reduce the claimed attribute isolation or signed weighted similarity to fitted parameters by construction or to prior self-referential results. The evaluation on cross-corpus retrieval over Harvard sentence corpora supplies an external benchmark, keeping the central claims independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that speech attributes are separable via linear projections after a shared encoder, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (1)
  • domain assumption Distinct speech attributes can be isolated into separate linear subspaces of a shared embedding vector through distillation or contrastive training
    Invoked in the description of the per-axis projection heads and their training objectives

pith-pipeline@v0.9.0 · 5446 in / 1252 out tokens · 53105 ms · 2026-05-11T00:43:13.713548+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Introduction Speech, in contrast with text, is inherently variable. We never say the same thing the same way twice: different speakers say the same thing differently, and even the same speaker says the same thing differently at different times, due to a variety of fac- tors such as mood, context, environment, and communicative intent. Beyond linguistic co...

  2. [3]

    extend content/timbre separation to three factors by adding prosody, using a V AE with restricted channel capacity to force disentanglement without explicit supervision. Their approach is generative and unsupervised; ours is discrimina- tive and supervision-guided, using pre-trained teacher mod- els to provide independent training signal for each axis. Th...

  3. [4]

    Representation Learning with Contrastive Predictive Coding

    Method 3.1. Architecture The model follows the SentenceTransformers pipeline: 1.Acoustic encoder: a frozen or fine-tuned HuggingFace audio model (default: WavLM-base-plus [6]) maps raw waveform frames to a sequence of hidden states. 2.Pooling: mean pooling over the frame sequence produces a single vector. 3.Multi-axis projection: a set of linear projectio...

  4. [5]

    Benchmarks such as SUPERB [8] evaluate speech embeddings under individual task metrics (e.g., speaker verification EER, content WER), but not controllable com- binations

    Evaluation: cross-corpus retrieval Existing evaluation frameworks do not directly measure multi- axis similarity. Benchmarks such as SUPERB [8] evaluate speech embeddings under individual task metrics (e.g., speaker verification EER, content WER), but not controllable com- binations. MSEB [35] provides large-scale retrieval evalua- tion across diverse sou...

  5. [6]

    In-domain semantic matching: Recovering missing labels for VCTK speaker “p315” by searching an index of VCTK speakers with labels

  6. [7]

    Out-of-domain semantic matching: Evaluating cross-corpus retrieval (rehasp→OSR); 20 of the sentences are common to both sets

  7. [8]

    easy” same-speaker matches in favour of “hard

    Out-of-domain semantic matching: Using negative weight- ing on the speaker axis to force the model to ignore “easy” same-speaker matches in favour of “hard” cross-speaker se- mantic matches. Because the goal is to support controllable similarity rather than optimise a single metric, evaluation focuses on whether axis weighting changes retrieval behaviour ...

  8. [9]

    Results and observations We evaluate eight model variants in a controlled ablation. All models usesemantic:384(matching the MiniLM teacher exactly, eliminating any alignment matrix on the semantic axis), except for the PCA baseline which usessemantic:256tar- gets derived by PCA projection of the MiniLM embedding space. No gender axis is included in any va...

  9. [10]

    Conclusion We presented a factor-partitioned embedding framework for speech that supports similarity search across multiple attribute axes simultaneously. Axis-specific projection heads are trained via teacher distillation or contrastive objectives over shared- label pairs, producing a single concatenated embedding that can be queried with signed per-axis...

  10. [11]

    Can the definition of each speaker be expected to come from the laboratory in the next decades?

    F. Nolan, “Can the definition of each speaker be expected to come from the laboratory in the next decades?” inProceedings of the XIIIth International Congress of Phonetic Sciences, vol. 3, Stock- holm, Sweden, Aug. 1995, pp. 130–137

  11. [12]

    The CMU Arctic speech databases,

    J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in5th ISCA Workshop on Speech Synthesis (SSW 5), 2004, pp. 223–224

  12. [13]

    The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,

    C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,” in2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Re- search and Evaluation (O-COCOSDA/CASLRE), 2013, pp. 1–4

  13. [14]

    wav2vec 2.0: a framework for self-supervised learning of speech representa- tions,

    A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representa- tions,” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

  14. [15]

    HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, p. 3451–3460, Oct. 2021. [Online]. Available: https://doi.org/10.1109/TASLP.2021.3122291

  15. [16]

    WavLM: Large-scale self-supervised pre-training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  16. [17]

    BERT: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds...

  17. [18]

    SUPERB: Speech Processing Universal PERformance Benchmark,

    S. wen Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” inInterspeech 2021, 2021, pp. 1194–1198

  18. [19]

    X-Vectors: Robust DNN embeddings for speaker recogni- tion,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-Vectors: Robust DNN embeddings for speaker recogni- tion,” inProc. ICASSP 2018, 2018, pp. 5329–5333

  19. [20]

    Deep convolutional acoustic word embeddings using word-pair side information,

    H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acoustic word embeddings using word-pair side information,” in Proc. ICASSP 2016, 2016, pp. 4950–4954

  20. [21]

    Multi-view recurrent neural acoustic word embeddings,

    W. He, W. Wang, and K. Livescu, “Multi-view recurrent neural acoustic word embeddings,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https: //openreview.net/forum?id=rJxDkvqee

  21. [22]

    Audio Word2Vec: Unsupervised learning of audio segment rep- resentations using sequence-to-sequence autoencoder,

    Y .-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y . Lee, and L.-S. Lee, “Audio Word2Vec: Unsupervised learning of audio segment rep- resentations using sequence-to-sequence autoencoder,” inInter- speech 2016, 2016, pp. 765–769

  22. [23]

    Sentence-BERT: Sentence embeddings using Siamese BERT-networks,

    N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” inProc. EMNLP- IJCNLP 2019, K. Inui, J. Jiang, V . Ng, and X. Wan, Eds. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3982–3992. [Online]. Available: https://aclanthology.org/D19-1410/

  23. [24]

    Conditional similarity networks,

    A. Veit, S. Belongie, and T. Karaletsos, “Conditional similarity networks,” in2017 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2017, pp. 1781–1789

  24. [25]

    SpeechDPR: End-to-end spo- ken passage retrieval for open-domain spoken question answer- ing,

    C.-J. Lin, G.-T. Lin, Y .-S. Chuang, W.-L. Wu, S.-W. Li, A. Mo- hamed, H.-Y . Lee, and L.-S. Lee, “SpeechDPR: End-to-end spo- ken passage retrieval for open-domain spoken question answer- ing,” inProc. ICASSP 2024, 2024, pp. 12 476–12 480

  25. [26]

    Speech retrieval-augmented generation without automatic speech recognition,

    D. J. Min, K. Mundnich, A. Lapastora, E. Soltanmohammadi, S. Ronanki, and K. Han, “Speech retrieval-augmented generation without automatic speech recognition,” inProc. ICASSP 2025, 2025, pp. 1–5

  26. [27]

    CLAP learning audio concepts from natural language supervision,

    B. Elizalde, S. Deshmukh, M. A. Ismail, and H. Wang, “CLAP learning audio concepts from natural language supervision,” in Proc. ICASSP 2023, 2023, pp. 1–5

  27. [28]

    Unsupervised learning of dis- entangled and interpretable representations from sequential data,

    W.-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised learning of dis- entangled and interpretable representations from sequential data,” inProceedings of the 31st International Conference on Neural In- formation Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 1876–1887

  28. [29]

    Unsupervised speech decomposition via triple information bot- tleneck,

    K. Qian, Y . Zhang, S. Chang, D. Cox, and M. Hasegawa-Johnson, “Unsupervised speech decomposition via triple information bot- tleneck,” inProceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020

  29. [30]

    ContentVec: An improved self-supervised speech representation by disentangling speakers,

    K. Qian, Y . Zhang, H. Gao, J. Ni, C.-I. Lai, D. Cox, M. Hasegawa-Johnson, and S. Chang, “ContentVec: An improved self-supervised speech representation by disentangling speakers,” inProceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, ...

  30. [31]

    SpeechTripleNet: End-to-end disentangled speech representation learning for content, timbre and prosody,

    H. Lu, X. Wu, Z. Wu, and H. Meng, “SpeechTripleNet: End-to-end disentangled speech representation learning for content, timbre and prosody,” inProceedings of the 31st ACM International Conference on Multimedia, ser. MM ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 2829–2837. [Online]. Available: https: //doi.org/10.1145/3581783.3612485

  31. [32]

    beta-V AE: Learning basic visual concepts with a con- strained variational framework,

    I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glo- rot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-V AE: Learning basic visual concepts with a con- strained variational framework,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=Sy2fzU9gl

  32. [33]

    Learning disentangled speech representations with contrastive learning and time-invariant retrieval,

    Y . Deng, H. Tang, X. Zhang, N. Cheng, J. Xiao, and J. Wang, “Learning disentangled speech representations with contrastive learning and time-invariant retrieval,” inProc. ICASSP 2024, 2024, pp. 10 271–10 275

  33. [34]

    SpeechTokenizer: Unified speech tokenizer for speech language models,

    X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “SpeechTokenizer: Unified speech tokenizer for speech language models,” inThe Twelfth International Confer- ence on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=AF9Q8Vip84

  34. [35]

    BEST-STD: Bidirectional mamba-enhanced speech tokenization for spoken term detection,

    A. Singh, K. Demuynck, and V . Arora, “BEST-STD: Bidirectional mamba-enhanced speech tokenization for spoken term detection,” inProc. ICASSP 2025, 2025, pp. 1–5

  35. [36]

    MiniLM: Deep self-attention distillation for task-agnostic com- pression of pre-trained transformers,

    W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “MiniLM: Deep self-attention distillation for task-agnostic com- pression of pre-trained transformers,” inProceedings of the 34th International Conference on Neural Information Processing Sys- tems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

  36. [37]

    CSTR VCTK Cor- pus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),

    J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Cor- pus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019

  37. [38]

    Open- source multi-speaker corpora of the English accents in the British Isles,

    I. Demirsahin, O. Kjartansson, A. Gutkin, and C. Rivera, “Open- source multi-speaker corpora of the English accents in the British Isles,” inProceedings of The 12th Language Resources and Evaluation Conference (LREC). Marseille, France: European Language Resources Association (ELRA), May 2020, pp. 6532–

  38. [39]

    Available: https://www.aclweb.org/anthology/ 2020.lrec-1.804

    [Online]. Available: https://www.aclweb.org/anthology/ 2020.lrec-1.804

  39. [40]

    IEEE recommended practice for speech quality mea- surements,

    IEEE, “IEEE recommended practice for speech quality mea- surements,”IEEE Transactions on Audio and Electroacoustics, vol. 17, no. 3, pp. 225–246, 1969

  40. [41]

    Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech,

    G. E. Henter, T. Merritt, M. Shannon, C. Mayo, and S. King, “Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech,” inInterspeech 2014, 2014, pp. 1504–1508

  41. [42]

    WhisperX: Time- accurate speech transcription of long-form audio,

    M. Bain, J. Huh, T. Han, and A. Zisserman, “WhisperX: Time- accurate speech transcription of long-form audio,” inInterspeech 2023, 2023, pp. 4489–4493

  42. [43]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. B ´echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J....

  43. [44]

    CommonAccent: Exploring large acoustic pretrained models for accent classification based on Common V oice,

    J. Zuluaga-Gomez, S. Ahmed, D. Visockas, and C. Subakan, “CommonAccent: Exploring large acoustic pretrained models for accent classification based on Common V oice,” inInterspeech 2023, 2023, pp. 5291–5295

  44. [45]

    Generalized end- to-end loss for speaker verification,

    L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end- to-end loss for speaker verification,” inProc. ICASSP 2018, 2018, pp. 4879–4883

  45. [46]

    Massive Sound Embedding Benchmark (MSEB),

    G. Heigold, E. Variani, T. Bagby, C. Allauzen, J. Ma, S. Kumar, and M. Riley, “Massive Sound Embedding Benchmark (MSEB),” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

  46. [47]

    Available: https://openreview.net/forum?id= X0juYgFVng

    [Online]. Available: https://openreview.net/forum?id= X0juYgFVng