Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization
Pith reviewed 2026-05-23 02:48 UTC · model grok-4.3
The pith
Speech-FT uses an initial drift-reducing fine-tune followed by weight interpolation to retain cross-task generalization while gaining task performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Speech-FT is a two-stage fine-tuning framework that first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. This produces higher feature similarity to the pre-trained model than methods that directly constrain weight changes, despite allowing larger overall updates, and yields concrete gains such as lowering phone error rate from 5.17% to 3.94% and raising speaker identification accuracy from 81.86% to 84.11% when fine-tuning HuBERT on automatic speech recognition.
What carries the argument
The two-stage sequence of drift-reducing fine-tuning followed by linear weight interpolation between the fine-tuned and pre-trained models.
If this is right
- Speech-FT improves performance on automatic speech recognition, speaker identification and other SUPERB tasks while preserving generalization.
- The approach outperforms weight-space regularization and LoRA across supervised, unsupervised and multitask scenarios.
- It maintains higher feature similarity to the pre-trained model than direct constraint methods despite larger weight updates.
- The same two-stage pattern works on multiple base models including HuBERT, wav2vec 2.0, DeCoAR 2.0 and WavLM Base+.
Where Pith is reading between the lines
- Controlling the trajectory of fine-tuning rather than only its magnitude may matter more for preserving generalization than previously assumed.
- The interpolation step could be tested with non-linear merging operators or applied after multiple drift-reduction rounds.
- Similar staged drift control plus merging might extend to other sequence models where pre-training and task adaptation conflict.
- The reported feature-similarity advantage suggests measuring representational drift directly during training could serve as an early stopping signal.
Load-bearing premise
That an initial fine-tuning stage can be engineered to cut representational drift enough for later weight interpolation to restore generalization without creating new failure modes.
What would settle it
An experiment in which Speech-FT produces lower feature similarity to the pre-trained model than a regularized fine-tune, or shows no cross-task improvement over standard fine-tuning on held-out SUPERB tasks.
Figures
read the original abstract
Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Speech-FT, a two-stage fine-tuning framework for speech representation models (HuBERT, wav2vec 2.0, DeCoAR 2.0, WavLM Base+). Stage one applies specialized fine-tuning to reduce representational drift from the pre-trained checkpoint; stage two performs linear weight-space interpolation between the stage-one checkpoint and the original pre-trained model. Experiments across supervised, unsupervised, and multitask scenarios on the SUPERB benchmark report consistent gains over standard fine-tuning, weight-regularization baselines, and LoRA, with examples including HuBERT ASR phone error rate reduced from 5.17% to 3.94%, word error rate from 6.38% to 5.75%, and speaker identification accuracy increased from 81.86% to 84.11%. Analysis asserts higher feature similarity to the pre-trained model despite larger weight updates.
Significance. If the empirical claims are substantiated, Speech-FT supplies a lightweight, post-pre-training refinement technique that improves task performance while restoring cross-task generalization more effectively than explicit weight-change constraints. The evaluation spans four model families and multiple training regimes, which is a constructive aspect of the work. The absence of statistical controls and ablation detail, however, currently prevents a firm assessment of whether the two-stage procedure delivers a genuine advance over simpler interpolation or regularization.
major comments (3)
- [Abstract / §4 (Experimental Results)] Abstract and experimental results sections: the headline improvements (e.g., PER 5.17% → 3.94%, WER 6.38% → 5.75%, SID 81.86% → 84.11% for HuBERT ASR) are reported as single-point estimates without error bars, standard deviations across random seeds, or statistical significance tests. Because the central claim is that Speech-FT “consistently improves performance,” the lack of these controls makes it impossible to judge whether the reported deltas exceed typical run-to-run variance.
- [§3 (Proposed Method)] Method section (description of the first-stage fine-tuning): the procedure is characterized only as “fine-tuning specifically designed to reduce representational drift,” yet no equation, loss term, or hyper-parameter schedule distinguishes this stage from ordinary supervised fine-tuning. Without an ablation that isolates the drift-reduction stage from the subsequent interpolation, it remains unclear whether the two-stage pipeline is required or whether direct interpolation of a standard fine-tuned checkpoint would produce equivalent results.
- [§5 (Analysis)] Analysis section on feature similarity: the assertion that Speech-FT achieves “higher feature similarity to the pre-trained model … despite allowing larger weight-space updates” is load-bearing for the generalization argument, but the manuscript supplies neither the precise similarity metric (e.g., layer-wise cosine similarity on held-out data), the interpolation ratio schedule, nor side-by-side tables comparing similarity values across Speech-FT, LoRA, and weight-regularized baselines. This gap directly affects the claim that the method restores cross-task generalization via the observed similarity.
minor comments (1)
- [§3] The manuscript would benefit from an explicit statement of the interpolation coefficient (or search range) used in all reported experiments, as this hyper-parameter directly controls the trade-off between task performance and generalization.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which identify key areas where the manuscript can be strengthened. We address each major comment below and commit to revisions that will incorporate additional experiments, clarifications, and quantitative details.
read point-by-point responses
-
Referee: [Abstract / §4 (Experimental Results)] Abstract and experimental results sections: the headline improvements (e.g., PER 5.17% → 3.94%, WER 6.38% → 5.75%, SID 81.86% → 84.11% for HuBERT ASR) are reported as single-point estimates without error bars, standard deviations across random seeds, or statistical significance tests. Because the central claim is that Speech-FT “consistently improves performance,” the lack of these controls makes it impossible to judge whether the reported deltas exceed typical run-to-run variance.
Authors: We acknowledge that single-point estimates without variance measures or significance tests weaken the ability to substantiate the claim of consistent improvement. In the revised manuscript we will rerun all primary experiments across at least five random seeds, report means and standard deviations, and include paired statistical significance tests (e.g., t-tests) for the headline comparisons against baselines. revision: yes
-
Referee: [§3 (Proposed Method)] Method section (description of the first-stage fine-tuning): the procedure is characterized only as “fine-tuning specifically designed to reduce representational drift,” yet no equation, loss term, or hyper-parameter schedule distinguishes this stage from ordinary supervised fine-tuning. Without an ablation that isolates the drift-reduction stage from the subsequent interpolation, it remains unclear whether the two-stage pipeline is required or whether direct interpolation of a standard fine-tuned checkpoint would produce equivalent results.
Authors: The referee correctly notes that the first-stage procedure lacks an explicit formulation. The drift-reduction stage employs a representation-level regularization term (added to the task loss) that penalizes deviation of intermediate activations from the pre-trained model; we will insert the precise loss equation and hyper-parameter schedule into Section 3. We will also add an ablation that directly compares (a) standard fine-tuning followed by interpolation versus (b) the proposed drift-reduced stage followed by interpolation, thereby isolating the contribution of each component. revision: yes
-
Referee: [§5 (Analysis)] Analysis section on feature similarity: the assertion that Speech-FT achieves “higher feature similarity to the pre-trained model … despite allowing larger weight-space updates” is load-bearing for the generalization argument, but the manuscript supplies neither the precise similarity metric (e.g., layer-wise cosine similarity on held-out data), the interpolation ratio schedule, nor side-by-side tables comparing similarity values across Speech-FT, LoRA, and weight-regularized baselines. This gap directly affects the claim that the method restores cross-task generalization via the observed similarity.
Authors: We agree that the similarity analysis requires greater specificity and comparative evidence. The revised manuscript will (i) define the metric as layer-wise cosine similarity evaluated on held-out data, (ii) report the interpolation ratios (including the schedule or selected α values), and (iii) add a table that juxtaposes similarity scores for Speech-FT, LoRA, and weight-regularization baselines, directly supporting the generalization argument. revision: yes
Circularity Check
Empirical method with external benchmarks; no derivation reduces to fitted inputs
full rationale
The paper proposes Speech-FT as a two-stage empirical procedure (drift-reducing fine-tuning then weight interpolation) and reports performance gains on HuBERT, wav2vec 2.0, etc., plus SUPERB metrics against standard fine-tuning, regularization, and LoRA baselines. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; results are measured on held-out tasks and models rather than being forced by construction from the same data used to define the method.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Excessive representational drift during fine-tuning is the main driver of lost cross-task generalization
Forward citations
Cited by 1 Pith paper
-
An Exploration of Mamba for Speech Self-Supervised Models
Mamba-based HuBERT models match or exceed Transformer versions on speech tasks while using far less compute for long sequences and streaming ASR.
Reference graph
Works this paper leans on
-
[1]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in The Thirty-Fourth Annual Conference on Neural Information Processing Systems, vol. 33, 2020, pp. 12 449–12 460
work page 2020
-
[2]
DeCoAR 2.0: Deep Contextualized Acoustic Repre- sentations with Vector Quantization,
S. Ling and Y . Liu, “DeCoAR 2.0: Deep Contextualized Acoustic Repre- sentations with Vector Quantization,”arXiv preprint arXiv:2012.06659, 2020
-
[3]
HuBERT: Self-Supervised Speech Representation Learn- ing by Masked Prediction of Hidden Units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learn- ing by Masked Prediction of Hidden Units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021
work page 2021
-
[4]
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,
S. Chenet al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
work page 2022
-
[5]
SUPERB: Speech Processing Universal PERformance Benchmark,
S.-w. Yanget al., “SUPERB: Speech Processing Universal PERformance Benchmark,” inInterspeech 2021, 2021, pp. 1194–1198. 14
work page 2021
-
[6]
Speech Rep- resentation Learning Through Self-Supervised Pretraining and Multi- Task Finetuning,
Y .-C. Chen, S.-w. Yang, C.-K. Lee, S. See, and H.-y. Lee, “Speech Rep- resentation Learning Through Self-Supervised Pretraining and Multi- Task Finetuning,”arXiv preprint arXiv:2110.09930, 2021
-
[7]
Y . Getman, T. Gr´osz, and M. Kurimo, “What Happens in Continued Pre- Training? Analysis of Self-Supervised Speech Models with Continued Pre-Training for Colloquial Finnish ASR,” inInterspeech 2024, 2024, pp. 5043–5047
work page 2024
-
[8]
Fine-Tuning Can Distort Pretrained Features and Underperform Out-of-Distribution,
A. Kumar, A. Raghunathan, R. Jones, T. Ma, and P. Liang, “Fine-Tuning Can Distort Pretrained Features and Underperform Out-of-Distribution,” inThe Tenth International Conference on Learning Representations, 2022, pp. 1–15
work page 2022
-
[9]
Overcoming Catastrophic Forgetting in Neural Networks,
J. Kirkpatricket al., “Overcoming Catastrophic Forgetting in Neural Networks,”Proceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017
work page 2017
-
[10]
Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension,
Y . Xu, X. Zhong, A. Jimeno-Yepes, and J. H. Lau, “Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension,”2020 International Joint Conference on Neural Net- works (IJCNN), pp. 1–8, 2019
work page 2020
-
[11]
Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting,
S. Chenet al., “Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020, pp. 7870–7881
work page 2020
-
[12]
En- gineering flexible machine learning systems by traversing functionally invariant paths,
G. Raghavan, B. Tharwat, S. N. Hari, D. Satani, and M. Thomson, “En- gineering flexible machine learning systems by traversing functionally invariant paths,”Nature Machine Intelligence, vol. 6, no. 10, pp. 1179– 1196, 2024
work page 2024
-
[13]
Editing Models with Task Arithmetic,
G. Ilharcoet al., “Editing Models with Task Arithmetic,” inThe Eleventh International Conference on Learning Representations, 2023, pp. 1–17
work page 2023
-
[14]
LoRA: Low-Rank Adaptation of Large Language Models,
E. J. Huet al., “LoRA: Low-Rank Adaptation of Large Language Models,” inThe Tenth International Conference on Learning Repre- sentations, 2022, pp. 1–13
work page 2022
-
[15]
MelHuBERT: A Simplified HuBERT on Mel Spectrograms,
T.-Q. Lin, H.-y. Lee, and H. Tang, “MelHuBERT: A Simplified HuBERT on Mel Spectrograms,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8
work page 2023
-
[16]
Robust Fine-Tuning of Zero-Shot Models,
M. Wortsmanet al., “Robust Fine-Tuning of Zero-Shot Models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 7959–7971
work page 2022
-
[17]
Spurious Feature Diversification Improves Out-of-Distribution Generalization,
Y . Lin, L. Tan, Y . Hao, H. Wong, H. Dong, W. Zhanget al., “Spurious Feature Diversification Improves Out-of-Distribution Generalization,” inThe Twelfth International Conference on Learning Representations, 2024, pp. 1–14
work page 2024
-
[18]
Mitigating the Alignment Tax of RLHF,
Y . Linet al., “Mitigating the Alignment Tax of RLHF,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 580–606
work page 2024
-
[19]
Sequential Editing for Lifelong Training of Speech Recognition Mod- els,
D. Kulshreshtha, N. Pappas, B. Houston, S. Dingliwal, and S. Ronanki, “Sequential Editing for Lifelong Training of Speech Recognition Mod- els,” inInterspeech 2024, 2024, pp. 3919–3923
work page 2024
-
[20]
TIES- Merging: Resolving Interference When Merging Models,
P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “TIES- Merging: Resolving Interference When Merging Models,” inThirty- seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[21]
TED-LIUM: An Auto- matic Speech Recognition Dedicated Corpus,
A. Rousseau, P. Del ´eglise, and Y . Est `eve, “TED-LIUM: An Auto- matic Speech Recognition Dedicated Corpus,” inProceedings of the Eighth International Conference on Language Resources and Evaluation (LREC‘12). European Language Resources Association (ELRA), 2012, pp. 125–129
work page 2012
-
[22]
DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1,
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1,”NASA STI/Recon Tech. Rep. N, vol. 93, 1993, Art. no. 27403
work page 1993
-
[23]
Phoneme Recognition on the TIMIT Database,
C. Lopes and F. Perdigao, “Phoneme Recognition on the TIMIT Database,” inSpeech Technologies, I. Ipsic, Ed. Rijeka: IntechOpen, 2011, ch. 14
work page 2011
-
[24]
LibriSpeech: An ASR Corpus Based on Public Domain Audio Books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR Corpus Based on Public Domain Audio Books,” inICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210
work page 2015
-
[25]
Representation Learning with Contrastive Predictive Coding
A. van den Oord, Y . Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,”arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,
H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014
work page 2014
-
[27]
EMO-SUPERB: An In-Depth Look at Speech Emotion Recognition,
H. Wuet al., “EMO-SUPERB: An In-Depth Look at Speech Emotion Recognition,”arXiv preprint arXiv:2402.13018, 2024
-
[28]
V oxCeleb: Large- Scale Speaker Verification in the Wild,
A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “V oxCeleb: Large- Scale Speaker Verification in the Wild,”Computer Speech & Language, vol. 60, p. 101027, 2020
work page 2020
-
[29]
IEMOCAP: Interactive Emotional Dyadic Motion Capture Database,
C. Bussoet al., “IEMOCAP: Interactive Emotional Dyadic Motion Capture Database,”Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008
work page 2008
-
[30]
AISHELL-3: A Multi- Speaker Mandarin TTS Corpus,
Y . Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A Multi- Speaker Mandarin TTS Corpus,” inInterspeech 2021, 2021, pp. 2756– 2760
work page 2021
-
[31]
DAISY: Data Adaptive Self- Supervised Early Exit for Speech Representation Models,
T.-Q. Lin, H.-y. Lee, and H. Tang, “DAISY: Data Adaptive Self- Supervised Early Exit for Speech Representation Models,” inInter- speech 2024, 2024, pp. 4513–4517
work page 2024
-
[32]
Fast- HuBERT: an Efficient Training Framework for Self-Supervised Speech Representation Learning,
G. Yang, Z. Ma, Z. Zheng, Y . Song, Z. Niu, and X. Chen, “Fast- HuBERT: an Efficient Training Framework for Self-Supervised Speech Representation Learning,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–7
work page 2023
-
[33]
Self-training for end-to-end speech recognition,
J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7084–7088
work page 2020
-
[34]
MiniSU- PERB: Lightweight Benchmark for Self-Supervised Speech Models,
Y .-H. Wang, H.-Y . Chen, K.-W. Chang, W. Hsu, and H.-y. Lee, “MiniSU- PERB: Lightweight Benchmark for Self-Supervised Speech Models,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8
work page 2023
-
[35]
C.-I. Lai, Y .-S. Chuang, H.-Y . Lee, S.-W. Li, and J. Glass, “Semi- Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7468–7472
work page 2021
-
[36]
T.-h. Fenget al., “SUPERB@SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1096–1103
work page 2022
-
[37]
A Large-Scale Evaluation of Speech Foundation Models,
S.-w. Yanget al., “A Large-Scale Evaluation of Speech Foundation Models,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2884–2899, 2024
work page 2024
-
[38]
ML-SUPERB: Multilingual Speech Universal PERfor- mance Benchmark,
J. Shiet al., “ML-SUPERB: Multilingual Speech Universal PERfor- mance Benchmark,” inInterspeech 2023, 2023, pp. 884–888
work page 2023
-
[39]
J. Shi, H. Inaguma, X. Ma, I. Kulikov, and A. Sun, “Multi-resolution Hu- BERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction,” inThe Twelfth International Conference on Learning Representations, 2024, pp. 1–19
work page 2024
-
[40]
Task- Agnostic Structured Pruning of Speech Representation Models,
H. Wang, S. Wang, W.-Q. Zhang, S. Hongbin, and Y . Wan, “Task- Agnostic Structured Pruning of Speech Representation Models,” in Interspeech 2023, 2023, pp. 231–235
work page 2023
-
[41]
H.-J. Changa, N. Dong, R. Mavlyutov, S. Popuri, and Y .-A. Chung, “COLLD: Contrastive Layer-to-Layer Distillation for Compressing Mul- tilingual Pre-Trained Speech Encoders,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 801–10 805
work page 2024
-
[42]
MLS: A Large-Scale Multilingual Dataset for Speech Research
V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,”arXiv preprint arXiv:2012.03411, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[43]
DoRA: Weight-Decomposed Low-Rank Adaptation,
S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.- T. Cheng, and M.-H. Chen, “DoRA: Weight-Decomposed Low-Rank Adaptation,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 32 100–32 121
work page 2024
-
[44]
Common V oice: A Massively-Multilingual Speech Cor- pus,
R. Ardilaet al., “Common V oice: A Massively-Multilingual Speech Cor- pus,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4218–4222
work page 2020
-
[45]
Self-Supervised Speech Representation Learning: A Review,
A. Mohamedet al., “Self-Supervised Speech Representation Learning: A Review,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, 2022
work page 2022
-
[46]
Exploring Efficient-Tuning Methods in Self-Supervised Speech Models,
Z.-C. Chen, C.-L. Fu, C.-Y . Liu, S.-W. Li, and H. yi Lee, “Exploring Efficient-Tuning Methods in Self-Supervised Speech Models,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1120–1127
work page 2023
-
[47]
Layer-wise analysis of a self-supervised speech representation model,
A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921
work page 2021
-
[48]
Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity,
Z. Zhou, Y . Yang, X. Yang, J. Yan, and W. Hu, “Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity,” inThe Thirty-seventh Annual Conference on Neural Information Processing Systems, vol. 36, 2023, pp. 60 853–60 877
work page 2023
-
[49]
A generalized solution of the orthogonal procrustes problem,
P. H. Sch ¨onemann, “A generalized solution of the orthogonal procrustes problem,”Psychometrika, vol. 31, no. 1, pp. 1–10, 1966
work page 1966
-
[50]
Ridge regression: Biased estimation for nonorthogonal problems,
A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,”Technometrics, vol. 12, no. 1, pp. 55–67, 1970. 15
work page 1970
-
[51]
Training- free model merging for multi-target domain adaptation,
W. Li, M. G. Huan-ang Gao, B. Tian, R. Zhi, and H. Zhao, “Training- free model merging for multi-target domain adaptation,” inEuropean Conference on Computer Vision (ECCV). Springer, 2024, pp. 419– 438
work page 2024
-
[52]
MagMax: Leveraging Model Merging for Seamless Continual Learning,
D. Marczak, B. Twardowski, T. Trzci ´nski, and S. Cygert, “MagMax: Leveraging Model Merging for Seamless Continual Learning,” inEu- ropean Conference on Computer Vision (ECCV). Springer, 2024, pp. 379–395
work page 2024
-
[53]
MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic,
Y . Zhou, L. Song, B. Wang, and W. Chen, “MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic,” inProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2024, pp. 1711–1724
work page 2024
-
[54]
Task Vector Algebra for ASR Models,
G. Ramesh and K. Audhkhasi, “Task Vector Algebra for ASR Models,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 256– 12 260
work page 2024
-
[55]
Parameter Averaging Is All You Need To Prevent Forgetting,
P. Plantinga, J. Yoo, A. Girma, and C. Dhir, “Parameter Averaging Is All You Need To Prevent Forgetting,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 271–278
work page 2024
-
[56]
An Attribute Interpolation Method in Speech Synthesis by Model Merging,
M. Murata, K. Miyazaki, and T. Koriyama, “An Attribute Interpolation Method in Speech Synthesis by Model Merging,” inInterspeech 2024, 2024, pp. 3380–3384
work page 2024
-
[57]
Task Arithmetic for Language Expansion in Speech Translation,
Y .-F. Chenget al., “Task Arithmetic for Language Expansion in Speech Translation,”arXiv preprint arXiv:2409.11274, 2024
-
[58]
M. Wortsmanet al., “Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time,” inProceedings of the 39th International Conference on Machine Learning, vol. 162. PMLR, 2022, pp. 23 965–23 998
work page 2022
-
[59]
Continual Learning for Multi-Dialect Acoustic Models,
B. Houston and K. K. Kirchhoff, “Continual Learning for Multi-Dialect Acoustic Models,” inInterspeech 2020, 2020, pp. 576–580
work page 2020
-
[60]
On Fine-Tuning Pre-Trained Speech Models With EMA-Target Self-Supervised Loss,
H. Yang and H.-G. Kang, “On Fine-Tuning Pre-Trained Speech Models With EMA-Target Self-Supervised Loss,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 6360–6364
work page 2024
-
[61]
Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition,
L.-W. Chen and A. Rudnicky, “Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
work page 2023
-
[62]
Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection,
Z. Pan, T. Liu, H. B. Sailor, and Q. Wang, “Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection,” inInterspeech 2024, 2024, pp. 2090–2094
work page 2024
-
[63]
A. Guragain, T. Liu, Z. Pan, H. B. Sailor, and Q. Wang, “Speech foundation model ensembles for the controlled singing voice deepfake detection (ctrsvdd) challenge 2024,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 774–781
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.