Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Hao Tang; Hung-yi Lee; Tzu-Quan Lin; Wei-Ping Huang

arxiv: 2502.12672 · v4 · submitted 2025-02-18 · 💻 cs.CL · cs.AI· cs.SD

Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Tzu-Quan Lin , Wei-Ping Huang , Hao Tang , Hung-yi Lee This is my paper

Pith reviewed 2026-05-23 02:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SD

keywords Speech-FTfine-tuningcross-task generalizationrepresentational driftweight interpolationspeech representation modelsSUPERB benchmarkHuBERT

0 comments

The pith

Speech-FT uses an initial drift-reducing fine-tune followed by weight interpolation to retain cross-task generalization while gaining task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-tuning speech representation models typically degrades their ability to handle unrelated tasks because representations drift too far from the pre-trained state. Speech-FT counters this with a two-stage process: first a fine-tuning step designed to limit that drift, then linear interpolation of the resulting weights with those of the original pre-trained model. Experiments across HuBERT, wav2vec 2.0, DeCoAR 2.0 and WavLM show consistent gains on supervised, unsupervised and multitask settings together with stronger cross-task results than regularization or LoRA baselines. A sympathetic reader would care because the method keeps the broad utility of pre-trained models while still allowing task-specific refinement.

Core claim

Speech-FT is a two-stage fine-tuning framework that first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. This produces higher feature similarity to the pre-trained model than methods that directly constrain weight changes, despite allowing larger overall updates, and yields concrete gains such as lowering phone error rate from 5.17% to 3.94% and raising speaker identification accuracy from 81.86% to 84.11% when fine-tuning HuBERT on automatic speech recognition.

What carries the argument

The two-stage sequence of drift-reducing fine-tuning followed by linear weight interpolation between the fine-tuned and pre-trained models.

If this is right

Speech-FT improves performance on automatic speech recognition, speaker identification and other SUPERB tasks while preserving generalization.
The approach outperforms weight-space regularization and LoRA across supervised, unsupervised and multitask scenarios.
It maintains higher feature similarity to the pre-trained model than direct constraint methods despite larger weight updates.
The same two-stage pattern works on multiple base models including HuBERT, wav2vec 2.0, DeCoAR 2.0 and WavLM Base+.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Controlling the trajectory of fine-tuning rather than only its magnitude may matter more for preserving generalization than previously assumed.
The interpolation step could be tested with non-linear merging operators or applied after multiple drift-reduction rounds.
Similar staged drift control plus merging might extend to other sequence models where pre-training and task adaptation conflict.
The reported feature-similarity advantage suggests measuring representational drift directly during training could serve as an early stopping signal.

Load-bearing premise

That an initial fine-tuning stage can be engineered to cut representational drift enough for later weight interpolation to restore generalization without creating new failure modes.

What would settle it

An experiment in which Speech-FT produces lower feature similarity to the pre-trained model than a regularized fine-tune, or shows no cross-task improvement over standard fine-tuning on held-out SUPERB tasks.

Figures

Figures reproduced from arXiv: 2502.12672 by Hao Tang, Hung-yi Lee, Tzu-Quan Lin, Wei-Ping Huang.

**Figure 1.** Figure 1: The pipeline of Speech-FT for representation learning and evaluation. Step 1: A pre-trained representation model [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Feature similarity with the pre-trained model. (Top) Effect of [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Average L2 distortion per parameter in the weight space with [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Speech-FT, a two-stage fine-tuning framework for speech representation models (HuBERT, wav2vec 2.0, DeCoAR 2.0, WavLM Base+). Stage one applies specialized fine-tuning to reduce representational drift from the pre-trained checkpoint; stage two performs linear weight-space interpolation between the stage-one checkpoint and the original pre-trained model. Experiments across supervised, unsupervised, and multitask scenarios on the SUPERB benchmark report consistent gains over standard fine-tuning, weight-regularization baselines, and LoRA, with examples including HuBERT ASR phone error rate reduced from 5.17% to 3.94%, word error rate from 6.38% to 5.75%, and speaker identification accuracy increased from 81.86% to 84.11%. Analysis asserts higher feature similarity to the pre-trained model despite larger weight updates.

Significance. If the empirical claims are substantiated, Speech-FT supplies a lightweight, post-pre-training refinement technique that improves task performance while restoring cross-task generalization more effectively than explicit weight-change constraints. The evaluation spans four model families and multiple training regimes, which is a constructive aspect of the work. The absence of statistical controls and ablation detail, however, currently prevents a firm assessment of whether the two-stage procedure delivers a genuine advance over simpler interpolation or regularization.

major comments (3)

[Abstract / §4 (Experimental Results)] Abstract and experimental results sections: the headline improvements (e.g., PER 5.17% → 3.94%, WER 6.38% → 5.75%, SID 81.86% → 84.11% for HuBERT ASR) are reported as single-point estimates without error bars, standard deviations across random seeds, or statistical significance tests. Because the central claim is that Speech-FT “consistently improves performance,” the lack of these controls makes it impossible to judge whether the reported deltas exceed typical run-to-run variance.
[§3 (Proposed Method)] Method section (description of the first-stage fine-tuning): the procedure is characterized only as “fine-tuning specifically designed to reduce representational drift,” yet no equation, loss term, or hyper-parameter schedule distinguishes this stage from ordinary supervised fine-tuning. Without an ablation that isolates the drift-reduction stage from the subsequent interpolation, it remains unclear whether the two-stage pipeline is required or whether direct interpolation of a standard fine-tuned checkpoint would produce equivalent results.
[§5 (Analysis)] Analysis section on feature similarity: the assertion that Speech-FT achieves “higher feature similarity to the pre-trained model … despite allowing larger weight-space updates” is load-bearing for the generalization argument, but the manuscript supplies neither the precise similarity metric (e.g., layer-wise cosine similarity on held-out data), the interpolation ratio schedule, nor side-by-side tables comparing similarity values across Speech-FT, LoRA, and weight-regularized baselines. This gap directly affects the claim that the method restores cross-task generalization via the observed similarity.

minor comments (1)

[§3] The manuscript would benefit from an explicit statement of the interpolation coefficient (or search range) used in all reported experiments, as this hyper-parameter directly controls the trade-off between task performance and generalization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which identify key areas where the manuscript can be strengthened. We address each major comment below and commit to revisions that will incorporate additional experiments, clarifications, and quantitative details.

read point-by-point responses

Referee: [Abstract / §4 (Experimental Results)] Abstract and experimental results sections: the headline improvements (e.g., PER 5.17% → 3.94%, WER 6.38% → 5.75%, SID 81.86% → 84.11% for HuBERT ASR) are reported as single-point estimates without error bars, standard deviations across random seeds, or statistical significance tests. Because the central claim is that Speech-FT “consistently improves performance,” the lack of these controls makes it impossible to judge whether the reported deltas exceed typical run-to-run variance.

Authors: We acknowledge that single-point estimates without variance measures or significance tests weaken the ability to substantiate the claim of consistent improvement. In the revised manuscript we will rerun all primary experiments across at least five random seeds, report means and standard deviations, and include paired statistical significance tests (e.g., t-tests) for the headline comparisons against baselines. revision: yes
Referee: [§3 (Proposed Method)] Method section (description of the first-stage fine-tuning): the procedure is characterized only as “fine-tuning specifically designed to reduce representational drift,” yet no equation, loss term, or hyper-parameter schedule distinguishes this stage from ordinary supervised fine-tuning. Without an ablation that isolates the drift-reduction stage from the subsequent interpolation, it remains unclear whether the two-stage pipeline is required or whether direct interpolation of a standard fine-tuned checkpoint would produce equivalent results.

Authors: The referee correctly notes that the first-stage procedure lacks an explicit formulation. The drift-reduction stage employs a representation-level regularization term (added to the task loss) that penalizes deviation of intermediate activations from the pre-trained model; we will insert the precise loss equation and hyper-parameter schedule into Section 3. We will also add an ablation that directly compares (a) standard fine-tuning followed by interpolation versus (b) the proposed drift-reduced stage followed by interpolation, thereby isolating the contribution of each component. revision: yes
Referee: [§5 (Analysis)] Analysis section on feature similarity: the assertion that Speech-FT achieves “higher feature similarity to the pre-trained model … despite allowing larger weight-space updates” is load-bearing for the generalization argument, but the manuscript supplies neither the precise similarity metric (e.g., layer-wise cosine similarity on held-out data), the interpolation ratio schedule, nor side-by-side tables comparing similarity values across Speech-FT, LoRA, and weight-regularized baselines. This gap directly affects the claim that the method restores cross-task generalization via the observed similarity.

Authors: We agree that the similarity analysis requires greater specificity and comparative evidence. The revised manuscript will (i) define the metric as layer-wise cosine similarity evaluated on held-out data, (ii) report the interpolation ratios (including the schedule or selected α values), and (iii) add a table that juxtaposes similarity scores for Speech-FT, LoRA, and weight-regularization baselines, directly supporting the generalization argument. revision: yes

Circularity Check

0 steps flagged

Empirical method with external benchmarks; no derivation reduces to fitted inputs

full rationale

The paper proposes Speech-FT as a two-stage empirical procedure (drift-reducing fine-tuning then weight interpolation) and reports performance gains on HuBERT, wav2vec 2.0, etc., plus SUPERB metrics against standard fine-tuning, regularization, and LoRA baselines. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; results are measured on held-out tasks and models rather than being forced by construction from the same data used to define the method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that representational drift is the primary cause of lost generalization and that linear interpolation in weight space can recover similarity without new side effects; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Excessive representational drift during fine-tuning is the main driver of lost cross-task generalization
Abstract states this as the cause of degradation and the target of the first stage.

pith-pipeline@v0.9.0 · 5873 in / 1101 out tokens · 24329 ms · 2026-05-23T02:48:37.776942+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

An Exploration of Mamba for Speech Self-Supervised Models
cs.CL 2025-06 unverdicted novelty 7.0

Mamba-based HuBERT models match or exceed Transformer versions on speech tasks while using far less compute for long sequences and streaming ASR.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in The Thirty-Fourth Annual Conference on Neural Information Processing Systems, vol. 33, 2020, pp. 12 449–12 460

work page 2020
[2]

DeCoAR 2.0: Deep Contextualized Acoustic Repre- sentations with Vector Quantization,

S. Ling and Y . Liu, “DeCoAR 2.0: Deep Contextualized Acoustic Repre- sentations with Vector Quantization,”arXiv preprint arXiv:2012.06659, 2020

work page arXiv 2012
[3]

HuBERT: Self-Supervised Speech Representation Learn- ing by Masked Prediction of Hidden Units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learn- ing by Masked Prediction of Hidden Units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

work page 2021
[4]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,

S. Chenet al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[5]

SUPERB: Speech Processing Universal PERformance Benchmark,

S.-w. Yanget al., “SUPERB: Speech Processing Universal PERformance Benchmark,” inInterspeech 2021, 2021, pp. 1194–1198. 14

work page 2021
[6]

Speech Rep- resentation Learning Through Self-Supervised Pretraining and Multi- Task Finetuning,

Y .-C. Chen, S.-w. Yang, C.-K. Lee, S. See, and H.-y. Lee, “Speech Rep- resentation Learning Through Self-Supervised Pretraining and Multi- Task Finetuning,”arXiv preprint arXiv:2110.09930, 2021

work page arXiv 2021
[7]

What Happens in Continued Pre- Training? Analysis of Self-Supervised Speech Models with Continued Pre-Training for Colloquial Finnish ASR,

Y . Getman, T. Gr´osz, and M. Kurimo, “What Happens in Continued Pre- Training? Analysis of Self-Supervised Speech Models with Continued Pre-Training for Colloquial Finnish ASR,” inInterspeech 2024, 2024, pp. 5043–5047

work page 2024
[8]

Fine-Tuning Can Distort Pretrained Features and Underperform Out-of-Distribution,

A. Kumar, A. Raghunathan, R. Jones, T. Ma, and P. Liang, “Fine-Tuning Can Distort Pretrained Features and Underperform Out-of-Distribution,” inThe Tenth International Conference on Learning Representations, 2022, pp. 1–15

work page 2022
[9]

Overcoming Catastrophic Forgetting in Neural Networks,

J. Kirkpatricket al., “Overcoming Catastrophic Forgetting in Neural Networks,”Proceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017

work page 2017
[10]

Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension,

Y . Xu, X. Zhong, A. Jimeno-Yepes, and J. H. Lau, “Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension,”2020 International Joint Conference on Neural Net- works (IJCNN), pp. 1–8, 2019

work page 2020
[11]

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting,

S. Chenet al., “Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020, pp. 7870–7881

work page 2020
[12]

En- gineering flexible machine learning systems by traversing functionally invariant paths,

G. Raghavan, B. Tharwat, S. N. Hari, D. Satani, and M. Thomson, “En- gineering flexible machine learning systems by traversing functionally invariant paths,”Nature Machine Intelligence, vol. 6, no. 10, pp. 1179– 1196, 2024

work page 2024
[13]

Editing Models with Task Arithmetic,

G. Ilharcoet al., “Editing Models with Task Arithmetic,” inThe Eleventh International Conference on Learning Representations, 2023, pp. 1–17

work page 2023
[14]

LoRA: Low-Rank Adaptation of Large Language Models,

E. J. Huet al., “LoRA: Low-Rank Adaptation of Large Language Models,” inThe Tenth International Conference on Learning Repre- sentations, 2022, pp. 1–13

work page 2022
[15]

MelHuBERT: A Simplified HuBERT on Mel Spectrograms,

T.-Q. Lin, H.-y. Lee, and H. Tang, “MelHuBERT: A Simplified HuBERT on Mel Spectrograms,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

work page 2023
[16]

Robust Fine-Tuning of Zero-Shot Models,

M. Wortsmanet al., “Robust Fine-Tuning of Zero-Shot Models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 7959–7971

work page 2022
[17]

Spurious Feature Diversification Improves Out-of-Distribution Generalization,

Y . Lin, L. Tan, Y . Hao, H. Wong, H. Dong, W. Zhanget al., “Spurious Feature Diversification Improves Out-of-Distribution Generalization,” inThe Twelfth International Conference on Learning Representations, 2024, pp. 1–14

work page 2024
[18]

Mitigating the Alignment Tax of RLHF,

Y . Linet al., “Mitigating the Alignment Tax of RLHF,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 580–606

work page 2024
[19]

Sequential Editing for Lifelong Training of Speech Recognition Mod- els,

D. Kulshreshtha, N. Pappas, B. Houston, S. Dingliwal, and S. Ronanki, “Sequential Editing for Lifelong Training of Speech Recognition Mod- els,” inInterspeech 2024, 2024, pp. 3919–3923

work page 2024
[20]

TIES- Merging: Resolving Interference When Merging Models,

P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “TIES- Merging: Resolving Interference When Merging Models,” inThirty- seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[21]

TED-LIUM: An Auto- matic Speech Recognition Dedicated Corpus,

A. Rousseau, P. Del ´eglise, and Y . Est `eve, “TED-LIUM: An Auto- matic Speech Recognition Dedicated Corpus,” inProceedings of the Eighth International Conference on Language Resources and Evaluation (LREC‘12). European Language Resources Association (ELRA), 2012, pp. 125–129

work page 2012
[22]

DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1,”NASA STI/Recon Tech. Rep. N, vol. 93, 1993, Art. no. 27403

work page 1993
[23]

Phoneme Recognition on the TIMIT Database,

C. Lopes and F. Perdigao, “Phoneme Recognition on the TIMIT Database,” inSpeech Technologies, I. Ipsic, Ed. Rijeka: IntechOpen, 2011, ch. 14

work page 2011
[24]

LibriSpeech: An ASR Corpus Based on Public Domain Audio Books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR Corpus Based on Public Domain Audio Books,” inICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210

work page 2015
[25]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y . Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,”arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014

work page 2014
[27]

EMO-SUPERB: An In-Depth Look at Speech Emotion Recognition,

H. Wuet al., “EMO-SUPERB: An In-Depth Look at Speech Emotion Recognition,”arXiv preprint arXiv:2402.13018, 2024

work page arXiv 2024
[28]

V oxCeleb: Large- Scale Speaker Verification in the Wild,

A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “V oxCeleb: Large- Scale Speaker Verification in the Wild,”Computer Speech & Language, vol. 60, p. 101027, 2020

work page 2020
[29]

IEMOCAP: Interactive Emotional Dyadic Motion Capture Database,

C. Bussoet al., “IEMOCAP: Interactive Emotional Dyadic Motion Capture Database,”Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008

work page 2008
[30]

AISHELL-3: A Multi- Speaker Mandarin TTS Corpus,

Y . Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A Multi- Speaker Mandarin TTS Corpus,” inInterspeech 2021, 2021, pp. 2756– 2760

work page 2021
[31]

DAISY: Data Adaptive Self- Supervised Early Exit for Speech Representation Models,

T.-Q. Lin, H.-y. Lee, and H. Tang, “DAISY: Data Adaptive Self- Supervised Early Exit for Speech Representation Models,” inInter- speech 2024, 2024, pp. 4513–4517

work page 2024
[32]

Fast- HuBERT: an Efficient Training Framework for Self-Supervised Speech Representation Learning,

G. Yang, Z. Ma, Z. Zheng, Y . Song, Z. Niu, and X. Chen, “Fast- HuBERT: an Efficient Training Framework for Self-Supervised Speech Representation Learning,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–7

work page 2023
[33]

Self-training for end-to-end speech recognition,

J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7084–7088

work page 2020
[34]

MiniSU- PERB: Lightweight Benchmark for Self-Supervised Speech Models,

Y .-H. Wang, H.-Y . Chen, K.-W. Chang, W. Hsu, and H.-y. Lee, “MiniSU- PERB: Lightweight Benchmark for Self-Supervised Speech Models,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

work page 2023
[35]

Semi- Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining,

C.-I. Lai, Y .-S. Chuang, H.-Y . Lee, S.-W. Li, and J. Glass, “Semi- Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7468–7472

work page 2021
[36]

SUPERB@SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning,

T.-h. Fenget al., “SUPERB@SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1096–1103

work page 2022
[37]

A Large-Scale Evaluation of Speech Foundation Models,

S.-w. Yanget al., “A Large-Scale Evaluation of Speech Foundation Models,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2884–2899, 2024

work page 2024
[38]

ML-SUPERB: Multilingual Speech Universal PERfor- mance Benchmark,

J. Shiet al., “ML-SUPERB: Multilingual Speech Universal PERfor- mance Benchmark,” inInterspeech 2023, 2023, pp. 884–888

work page 2023
[39]

Multi-resolution Hu- BERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction,

J. Shi, H. Inaguma, X. Ma, I. Kulikov, and A. Sun, “Multi-resolution Hu- BERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction,” inThe Twelfth International Conference on Learning Representations, 2024, pp. 1–19

work page 2024
[40]

Task- Agnostic Structured Pruning of Speech Representation Models,

H. Wang, S. Wang, W.-Q. Zhang, S. Hongbin, and Y . Wan, “Task- Agnostic Structured Pruning of Speech Representation Models,” in Interspeech 2023, 2023, pp. 231–235

work page 2023
[41]

COLLD: Contrastive Layer-to-Layer Distillation for Compressing Mul- tilingual Pre-Trained Speech Encoders,

H.-J. Changa, N. Dong, R. Mavlyutov, S. Popuri, and Y .-A. Chung, “COLLD: Contrastive Layer-to-Layer Distillation for Compressing Mul- tilingual Pre-Trained Speech Encoders,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 801–10 805

work page 2024
[42]

MLS: A Large-Scale Multilingual Dataset for Speech Research

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,”arXiv preprint arXiv:2012.03411, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2012
[43]

DoRA: Weight-Decomposed Low-Rank Adaptation,

S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.- T. Cheng, and M.-H. Chen, “DoRA: Weight-Decomposed Low-Rank Adaptation,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 32 100–32 121

work page 2024
[44]

Common V oice: A Massively-Multilingual Speech Cor- pus,

R. Ardilaet al., “Common V oice: A Massively-Multilingual Speech Cor- pus,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4218–4222

work page 2020
[45]

Self-Supervised Speech Representation Learning: A Review,

A. Mohamedet al., “Self-Supervised Speech Representation Learning: A Review,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, 2022

work page 2022
[46]

Exploring Efficient-Tuning Methods in Self-Supervised Speech Models,

Z.-C. Chen, C.-L. Fu, C.-Y . Liu, S.-W. Li, and H. yi Lee, “Exploring Efficient-Tuning Methods in Self-Supervised Speech Models,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1120–1127

work page 2023
[47]

Layer-wise analysis of a self-supervised speech representation model,

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921

work page 2021
[48]

Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity,

Z. Zhou, Y . Yang, X. Yang, J. Yan, and W. Hu, “Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity,” inThe Thirty-seventh Annual Conference on Neural Information Processing Systems, vol. 36, 2023, pp. 60 853–60 877

work page 2023
[49]

A generalized solution of the orthogonal procrustes problem,

P. H. Sch ¨onemann, “A generalized solution of the orthogonal procrustes problem,”Psychometrika, vol. 31, no. 1, pp. 1–10, 1966

work page 1966
[50]

Ridge regression: Biased estimation for nonorthogonal problems,

A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,”Technometrics, vol. 12, no. 1, pp. 55–67, 1970. 15

work page 1970
[51]

Training- free model merging for multi-target domain adaptation,

W. Li, M. G. Huan-ang Gao, B. Tian, R. Zhi, and H. Zhao, “Training- free model merging for multi-target domain adaptation,” inEuropean Conference on Computer Vision (ECCV). Springer, 2024, pp. 419– 438

work page 2024
[52]

MagMax: Leveraging Model Merging for Seamless Continual Learning,

D. Marczak, B. Twardowski, T. Trzci ´nski, and S. Cygert, “MagMax: Leveraging Model Merging for Seamless Continual Learning,” inEu- ropean Conference on Computer Vision (ECCV). Springer, 2024, pp. 379–395

work page 2024
[53]

MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic,

Y . Zhou, L. Song, B. Wang, and W. Chen, “MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic,” inProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2024, pp. 1711–1724

work page 2024
[54]

Task Vector Algebra for ASR Models,

G. Ramesh and K. Audhkhasi, “Task Vector Algebra for ASR Models,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 256– 12 260

work page 2024
[55]

Parameter Averaging Is All You Need To Prevent Forgetting,

P. Plantinga, J. Yoo, A. Girma, and C. Dhir, “Parameter Averaging Is All You Need To Prevent Forgetting,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 271–278

work page 2024
[56]

An Attribute Interpolation Method in Speech Synthesis by Model Merging,

M. Murata, K. Miyazaki, and T. Koriyama, “An Attribute Interpolation Method in Speech Synthesis by Model Merging,” inInterspeech 2024, 2024, pp. 3380–3384

work page 2024
[57]

Task Arithmetic for Language Expansion in Speech Translation,

Y .-F. Chenget al., “Task Arithmetic for Language Expansion in Speech Translation,”arXiv preprint arXiv:2409.11274, 2024

work page arXiv 2024
[58]

Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time,

M. Wortsmanet al., “Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time,” inProceedings of the 39th International Conference on Machine Learning, vol. 162. PMLR, 2022, pp. 23 965–23 998

work page 2022
[59]

Continual Learning for Multi-Dialect Acoustic Models,

B. Houston and K. K. Kirchhoff, “Continual Learning for Multi-Dialect Acoustic Models,” inInterspeech 2020, 2020, pp. 576–580

work page 2020
[60]

On Fine-Tuning Pre-Trained Speech Models With EMA-Target Self-Supervised Loss,

H. Yang and H.-G. Kang, “On Fine-Tuning Pre-Trained Speech Models With EMA-Target Self-Supervised Loss,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 6360–6364

work page 2024
[61]

Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition,

L.-W. Chen and A. Rudnicky, “Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[62]

Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection,

Z. Pan, T. Liu, H. B. Sailor, and Q. Wang, “Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection,” inInterspeech 2024, 2024, pp. 2090–2094

work page 2024
[63]

Speech foundation model ensembles for the controlled singing voice deepfake detection (ctrsvdd) challenge 2024,

A. Guragain, T. Liu, Z. Pan, H. B. Sailor, and Q. Wang, “Speech foundation model ensembles for the controlled singing voice deepfake detection (ctrsvdd) challenge 2024,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 774–781

work page 2024

[1] [1]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in The Thirty-Fourth Annual Conference on Neural Information Processing Systems, vol. 33, 2020, pp. 12 449–12 460

work page 2020

[2] [2]

DeCoAR 2.0: Deep Contextualized Acoustic Repre- sentations with Vector Quantization,

S. Ling and Y . Liu, “DeCoAR 2.0: Deep Contextualized Acoustic Repre- sentations with Vector Quantization,”arXiv preprint arXiv:2012.06659, 2020

work page arXiv 2012

[3] [3]

HuBERT: Self-Supervised Speech Representation Learn- ing by Masked Prediction of Hidden Units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learn- ing by Masked Prediction of Hidden Units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

work page 2021

[4] [4]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,

S. Chenet al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[5] [5]

SUPERB: Speech Processing Universal PERformance Benchmark,

S.-w. Yanget al., “SUPERB: Speech Processing Universal PERformance Benchmark,” inInterspeech 2021, 2021, pp. 1194–1198. 14

work page 2021

[6] [6]

Speech Rep- resentation Learning Through Self-Supervised Pretraining and Multi- Task Finetuning,

Y .-C. Chen, S.-w. Yang, C.-K. Lee, S. See, and H.-y. Lee, “Speech Rep- resentation Learning Through Self-Supervised Pretraining and Multi- Task Finetuning,”arXiv preprint arXiv:2110.09930, 2021

work page arXiv 2021

[7] [7]

What Happens in Continued Pre- Training? Analysis of Self-Supervised Speech Models with Continued Pre-Training for Colloquial Finnish ASR,

Y . Getman, T. Gr´osz, and M. Kurimo, “What Happens in Continued Pre- Training? Analysis of Self-Supervised Speech Models with Continued Pre-Training for Colloquial Finnish ASR,” inInterspeech 2024, 2024, pp. 5043–5047

work page 2024

[8] [8]

Fine-Tuning Can Distort Pretrained Features and Underperform Out-of-Distribution,

A. Kumar, A. Raghunathan, R. Jones, T. Ma, and P. Liang, “Fine-Tuning Can Distort Pretrained Features and Underperform Out-of-Distribution,” inThe Tenth International Conference on Learning Representations, 2022, pp. 1–15

work page 2022

[9] [9]

Overcoming Catastrophic Forgetting in Neural Networks,

J. Kirkpatricket al., “Overcoming Catastrophic Forgetting in Neural Networks,”Proceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017

work page 2017

[10] [10]

Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension,

Y . Xu, X. Zhong, A. Jimeno-Yepes, and J. H. Lau, “Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension,”2020 International Joint Conference on Neural Net- works (IJCNN), pp. 1–8, 2019

work page 2020

[11] [11]

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting,

S. Chenet al., “Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020, pp. 7870–7881

work page 2020

[12] [12]

En- gineering flexible machine learning systems by traversing functionally invariant paths,

G. Raghavan, B. Tharwat, S. N. Hari, D. Satani, and M. Thomson, “En- gineering flexible machine learning systems by traversing functionally invariant paths,”Nature Machine Intelligence, vol. 6, no. 10, pp. 1179– 1196, 2024

work page 2024

[13] [13]

Editing Models with Task Arithmetic,

G. Ilharcoet al., “Editing Models with Task Arithmetic,” inThe Eleventh International Conference on Learning Representations, 2023, pp. 1–17

work page 2023

[14] [14]

LoRA: Low-Rank Adaptation of Large Language Models,

E. J. Huet al., “LoRA: Low-Rank Adaptation of Large Language Models,” inThe Tenth International Conference on Learning Repre- sentations, 2022, pp. 1–13

work page 2022

[15] [15]

MelHuBERT: A Simplified HuBERT on Mel Spectrograms,

T.-Q. Lin, H.-y. Lee, and H. Tang, “MelHuBERT: A Simplified HuBERT on Mel Spectrograms,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

work page 2023

[16] [16]

Robust Fine-Tuning of Zero-Shot Models,

M. Wortsmanet al., “Robust Fine-Tuning of Zero-Shot Models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 7959–7971

work page 2022

[17] [17]

Spurious Feature Diversification Improves Out-of-Distribution Generalization,

Y . Lin, L. Tan, Y . Hao, H. Wong, H. Dong, W. Zhanget al., “Spurious Feature Diversification Improves Out-of-Distribution Generalization,” inThe Twelfth International Conference on Learning Representations, 2024, pp. 1–14

work page 2024

[18] [18]

Mitigating the Alignment Tax of RLHF,

Y . Linet al., “Mitigating the Alignment Tax of RLHF,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 580–606

work page 2024

[19] [19]

Sequential Editing for Lifelong Training of Speech Recognition Mod- els,

D. Kulshreshtha, N. Pappas, B. Houston, S. Dingliwal, and S. Ronanki, “Sequential Editing for Lifelong Training of Speech Recognition Mod- els,” inInterspeech 2024, 2024, pp. 3919–3923

work page 2024

[20] [20]

TIES- Merging: Resolving Interference When Merging Models,

P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “TIES- Merging: Resolving Interference When Merging Models,” inThirty- seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[21] [21]

TED-LIUM: An Auto- matic Speech Recognition Dedicated Corpus,

A. Rousseau, P. Del ´eglise, and Y . Est `eve, “TED-LIUM: An Auto- matic Speech Recognition Dedicated Corpus,” inProceedings of the Eighth International Conference on Language Resources and Evaluation (LREC‘12). European Language Resources Association (ELRA), 2012, pp. 125–129

work page 2012

[22] [22]

DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1,”NASA STI/Recon Tech. Rep. N, vol. 93, 1993, Art. no. 27403

work page 1993

[23] [23]

Phoneme Recognition on the TIMIT Database,

C. Lopes and F. Perdigao, “Phoneme Recognition on the TIMIT Database,” inSpeech Technologies, I. Ipsic, Ed. Rijeka: IntechOpen, 2011, ch. 14

work page 2011

[24] [24]

LibriSpeech: An ASR Corpus Based on Public Domain Audio Books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR Corpus Based on Public Domain Audio Books,” inICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210

work page 2015

[25] [25]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y . Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,”arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014

work page 2014

[27] [27]

EMO-SUPERB: An In-Depth Look at Speech Emotion Recognition,

H. Wuet al., “EMO-SUPERB: An In-Depth Look at Speech Emotion Recognition,”arXiv preprint arXiv:2402.13018, 2024

work page arXiv 2024

[28] [28]

V oxCeleb: Large- Scale Speaker Verification in the Wild,

A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “V oxCeleb: Large- Scale Speaker Verification in the Wild,”Computer Speech & Language, vol. 60, p. 101027, 2020

work page 2020

[29] [29]

IEMOCAP: Interactive Emotional Dyadic Motion Capture Database,

C. Bussoet al., “IEMOCAP: Interactive Emotional Dyadic Motion Capture Database,”Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008

work page 2008

[30] [30]

AISHELL-3: A Multi- Speaker Mandarin TTS Corpus,

Y . Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A Multi- Speaker Mandarin TTS Corpus,” inInterspeech 2021, 2021, pp. 2756– 2760

work page 2021

[31] [31]

DAISY: Data Adaptive Self- Supervised Early Exit for Speech Representation Models,

T.-Q. Lin, H.-y. Lee, and H. Tang, “DAISY: Data Adaptive Self- Supervised Early Exit for Speech Representation Models,” inInter- speech 2024, 2024, pp. 4513–4517

work page 2024

[32] [32]

Fast- HuBERT: an Efficient Training Framework for Self-Supervised Speech Representation Learning,

G. Yang, Z. Ma, Z. Zheng, Y . Song, Z. Niu, and X. Chen, “Fast- HuBERT: an Efficient Training Framework for Self-Supervised Speech Representation Learning,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–7

work page 2023

[33] [33]

Self-training for end-to-end speech recognition,

J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7084–7088

work page 2020

[34] [34]

MiniSU- PERB: Lightweight Benchmark for Self-Supervised Speech Models,

Y .-H. Wang, H.-Y . Chen, K.-W. Chang, W. Hsu, and H.-y. Lee, “MiniSU- PERB: Lightweight Benchmark for Self-Supervised Speech Models,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

work page 2023

[35] [35]

Semi- Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining,

C.-I. Lai, Y .-S. Chuang, H.-Y . Lee, S.-W. Li, and J. Glass, “Semi- Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7468–7472

work page 2021

[36] [36]

SUPERB@SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning,

T.-h. Fenget al., “SUPERB@SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1096–1103

work page 2022

[37] [37]

A Large-Scale Evaluation of Speech Foundation Models,

S.-w. Yanget al., “A Large-Scale Evaluation of Speech Foundation Models,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2884–2899, 2024

work page 2024

[38] [38]

ML-SUPERB: Multilingual Speech Universal PERfor- mance Benchmark,

J. Shiet al., “ML-SUPERB: Multilingual Speech Universal PERfor- mance Benchmark,” inInterspeech 2023, 2023, pp. 884–888

work page 2023

[39] [39]

Multi-resolution Hu- BERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction,

J. Shi, H. Inaguma, X. Ma, I. Kulikov, and A. Sun, “Multi-resolution Hu- BERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction,” inThe Twelfth International Conference on Learning Representations, 2024, pp. 1–19

work page 2024

[40] [40]

Task- Agnostic Structured Pruning of Speech Representation Models,

H. Wang, S. Wang, W.-Q. Zhang, S. Hongbin, and Y . Wan, “Task- Agnostic Structured Pruning of Speech Representation Models,” in Interspeech 2023, 2023, pp. 231–235

work page 2023

[41] [41]

COLLD: Contrastive Layer-to-Layer Distillation for Compressing Mul- tilingual Pre-Trained Speech Encoders,

H.-J. Changa, N. Dong, R. Mavlyutov, S. Popuri, and Y .-A. Chung, “COLLD: Contrastive Layer-to-Layer Distillation for Compressing Mul- tilingual Pre-Trained Speech Encoders,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 801–10 805

work page 2024

[42] [42]

MLS: A Large-Scale Multilingual Dataset for Speech Research

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,”arXiv preprint arXiv:2012.03411, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2012

[43] [43]

DoRA: Weight-Decomposed Low-Rank Adaptation,

S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.- T. Cheng, and M.-H. Chen, “DoRA: Weight-Decomposed Low-Rank Adaptation,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 32 100–32 121

work page 2024

[44] [44]

Common V oice: A Massively-Multilingual Speech Cor- pus,

R. Ardilaet al., “Common V oice: A Massively-Multilingual Speech Cor- pus,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4218–4222

work page 2020

[45] [45]

Self-Supervised Speech Representation Learning: A Review,

A. Mohamedet al., “Self-Supervised Speech Representation Learning: A Review,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, 2022

work page 2022

[46] [46]

Exploring Efficient-Tuning Methods in Self-Supervised Speech Models,

Z.-C. Chen, C.-L. Fu, C.-Y . Liu, S.-W. Li, and H. yi Lee, “Exploring Efficient-Tuning Methods in Self-Supervised Speech Models,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1120–1127

work page 2023

[47] [47]

Layer-wise analysis of a self-supervised speech representation model,

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921

work page 2021

[48] [48]

Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity,

Z. Zhou, Y . Yang, X. Yang, J. Yan, and W. Hu, “Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity,” inThe Thirty-seventh Annual Conference on Neural Information Processing Systems, vol. 36, 2023, pp. 60 853–60 877

work page 2023

[49] [49]

A generalized solution of the orthogonal procrustes problem,

P. H. Sch ¨onemann, “A generalized solution of the orthogonal procrustes problem,”Psychometrika, vol. 31, no. 1, pp. 1–10, 1966

work page 1966

[50] [50]

Ridge regression: Biased estimation for nonorthogonal problems,

A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,”Technometrics, vol. 12, no. 1, pp. 55–67, 1970. 15

work page 1970

[51] [51]

Training- free model merging for multi-target domain adaptation,

W. Li, M. G. Huan-ang Gao, B. Tian, R. Zhi, and H. Zhao, “Training- free model merging for multi-target domain adaptation,” inEuropean Conference on Computer Vision (ECCV). Springer, 2024, pp. 419– 438

work page 2024

[52] [52]

MagMax: Leveraging Model Merging for Seamless Continual Learning,

D. Marczak, B. Twardowski, T. Trzci ´nski, and S. Cygert, “MagMax: Leveraging Model Merging for Seamless Continual Learning,” inEu- ropean Conference on Computer Vision (ECCV). Springer, 2024, pp. 379–395

work page 2024

[53] [53]

MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic,

Y . Zhou, L. Song, B. Wang, and W. Chen, “MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic,” inProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2024, pp. 1711–1724

work page 2024

[54] [54]

Task Vector Algebra for ASR Models,

G. Ramesh and K. Audhkhasi, “Task Vector Algebra for ASR Models,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 256– 12 260

work page 2024

[55] [55]

Parameter Averaging Is All You Need To Prevent Forgetting,

P. Plantinga, J. Yoo, A. Girma, and C. Dhir, “Parameter Averaging Is All You Need To Prevent Forgetting,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 271–278

work page 2024

[56] [56]

An Attribute Interpolation Method in Speech Synthesis by Model Merging,

M. Murata, K. Miyazaki, and T. Koriyama, “An Attribute Interpolation Method in Speech Synthesis by Model Merging,” inInterspeech 2024, 2024, pp. 3380–3384

work page 2024

[57] [57]

Task Arithmetic for Language Expansion in Speech Translation,

Y .-F. Chenget al., “Task Arithmetic for Language Expansion in Speech Translation,”arXiv preprint arXiv:2409.11274, 2024

work page arXiv 2024

[58] [58]

Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time,

M. Wortsmanet al., “Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time,” inProceedings of the 39th International Conference on Machine Learning, vol. 162. PMLR, 2022, pp. 23 965–23 998

work page 2022

[59] [59]

Continual Learning for Multi-Dialect Acoustic Models,

B. Houston and K. K. Kirchhoff, “Continual Learning for Multi-Dialect Acoustic Models,” inInterspeech 2020, 2020, pp. 576–580

work page 2020

[60] [60]

On Fine-Tuning Pre-Trained Speech Models With EMA-Target Self-Supervised Loss,

H. Yang and H.-G. Kang, “On Fine-Tuning Pre-Trained Speech Models With EMA-Target Self-Supervised Loss,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 6360–6364

work page 2024

[61] [61]

Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition,

L.-W. Chen and A. Rudnicky, “Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023

[62] [62]

Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection,

Z. Pan, T. Liu, H. B. Sailor, and Q. Wang, “Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection,” inInterspeech 2024, 2024, pp. 2090–2094

work page 2024

[63] [63]

Speech foundation model ensembles for the controlled singing voice deepfake detection (ctrsvdd) challenge 2024,

A. Guragain, T. Liu, Z. Pan, H. B. Sailor, and Q. Wang, “Speech foundation model ensembles for the controlled singing voice deepfake detection (ctrsvdd) challenge 2024,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 774–781

work page 2024