Improving Music Source Separation with Diffusion and Consistency Refinement
Pith reviewed 2026-05-23 07:30 UTC · model grok-4.3
The pith
A diffusion model refines outputs from any music source separator and consistency distillation reduces the process to a single step while preserving gains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that training a diffusion model on the separated sources produced by a deterministic separator, then distilling it for consistency, produces a general-purpose refiner that raises separation quality and reaches state-of-the-art scores when attached to either a custom U-Net on Slakh2100 or the BS-RoFormer on MUSDB18, with the distilled version requiring only one inference step.
What carries the argument
The consistency-distilled diffusion refiner: a generative model that takes a base separator's output and applies learned denoising to correct residual interference, distilled so that one step approximates the full iterative process.
If this is right
- Quality gains appear when the refiner is placed after a U-Net separator on Slakh2100.
- State-of-the-art scores are reached when the same refiner is placed after BS-RoFormer on MUSDB18.
- Single-step inference from the distilled model maintains the quality improvement of the original diffusion process.
- Two or more steps from the distilled model exceed the quality of the undiluted diffusion refiner.
- No architecture-specific retraining of the refiner is required to obtain the reported gains.
Where Pith is reading between the lines
- The same refinement stage could be tested on other audio tasks such as enhancement or dereverberation that already use deterministic front-ends.
- If the refiner generalizes across backbones, designers might deliberately train lighter or faster base separators and rely on the distilled stage for final quality.
- Training the refiner on a mixture of outputs from several different separators might further increase robustness to domain shift.
Load-bearing premise
A diffusion model trained on the outputs of one separator can be applied as a general last-stage refiner to the outputs of other separators without introducing new artifacts or domain-shift problems.
What would settle it
If the distilled refiner applied to a new separator produces lower objective separation scores or audible artifacts relative to the base separator alone, the claim of architecture-agnostic refinement would be falsified.
Figures
read the original abstract
In this work, we propose an approach to music source separation that uses a generative diffusion model as a last-stage refinement on top of a deterministic separator, progressively enhancing the separated sources through iterative denoising. While the diffusion refinement yields measurable quality gains, it requires iterative steps at inference, increasing computational cost. To speed up the inference process, we apply consistency distillation, reducing inference to a single step while maintaining quality; with two or more steps, the distilled model even surpasses the diffusion-based approach. Crucially, our method is architecture-agnostic: we demonstrate state-of-the-art results when applied to both a custom U-Net-based separator on Slakh2100 and the state-of-the-art BS-RoFormer model on MUSDB18, showing that the refinement generalizes across backbone architectures. Sound examples are available at: https://consistency-separation.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a post-processing pipeline for music source separation that applies a generative diffusion model to iteratively refine the outputs of a deterministic separator, followed by consistency distillation to reduce inference to one or a few steps. It reports measurable quality improvements and claims the method is architecture-agnostic, achieving state-of-the-art results when applied to a custom U-Net on Slakh2100 and to BS-RoFormer on MUSDB18.
Significance. If the generalization claim holds, the approach would offer a modular, architecture-independent refinement stage that can be distilled for fast inference, potentially improving a range of existing separators with limited added cost. The consistency-distillation component addresses a practical inference bottleneck and is a clear technical contribution if the quality retention is rigorously quantified.
major comments (3)
- [Abstract / Experiments] Abstract and experimental claims: the assertion that the method is 'architecture-agnostic' and that 'the refinement generalizes across backbone architectures' is not supported by the reported experiments. The diffusion model is trained separately on the outputs of each specific separator (U-Net outputs on Slakh2100; BS-RoFormer outputs on MUSDB18), so the results demonstrate per-backbone training rather than zero-shot or cross-backbone transfer of a single refiner.
- [Experiments] Experimental design (results section): because the training distributions are tied to both the backbone outputs and the dataset (Slakh2100 vs. MUSDB18), the setup confounds backbone generalization with dataset shift. A load-bearing test for the central claim would require training the refiner on one backbone/dataset and evaluating it on the other without retraining.
- [Abstract] Abstract: the claim of 'state-of-the-art results' and 'measurable quality gains' is stated without any numerical metrics, baseline SDR/SI-SDR values, or statistical significance tests, making it impossible to assess the magnitude or reliability of the reported improvements from the provided summary.
minor comments (2)
- [Abstract] The abstract states that consistency distillation 'maintains quality' or 'surpasses' the diffusion baseline with two or more steps, but does not specify the exact number of steps, the distillation loss, or the quantitative comparison table that would allow readers to verify the trade-off.
- [Abstract / Results] Sound examples are referenced via a URL, but the manuscript should include a table or figure summarizing the objective metrics (e.g., SDR) for the main results to make the claims self-contained.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We agree that several claims in the abstract require clarification and supporting numbers, and we will revise accordingly. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Abstract / Experiments] the assertion that the method is 'architecture-agnostic' and that 'the refinement generalizes across backbone architectures' is not supported by the reported experiments. The diffusion model is trained separately on the outputs of each specific separator (U-Net outputs on Slakh2100; BS-RoFormer outputs on MUSDB18), so the results demonstrate per-backbone training rather than zero-shot or cross-backbone transfer of a single refiner.
Authors: We accept the distinction. Our experiments demonstrate that the proposed refinement pipeline (diffusion + consistency distillation) can be successfully trained and applied to two different separator architectures, but they do not show zero-shot transfer of a single refiner. We will revise the abstract and introduction to replace 'architecture-agnostic' with the more precise phrasing 'applicable to multiple backbone architectures when the refiner is trained on the corresponding separator outputs.' revision: partial
-
Referee: [Experiments] because the training distributions are tied to both the backbone outputs and the dataset (Slakh2100 vs. MUSDB18), the setup confounds backbone generalization with dataset shift. A load-bearing test for the central claim would require training the refiner on one backbone/dataset and evaluating it on the other without retraining.
Authors: The referee correctly identifies the confounding. The current results cannot isolate backbone effects from dataset effects. We will add an explicit limitations paragraph acknowledging that a cross-backbone, cross-dataset transfer experiment was not performed and would be a valuable direction for future work. revision: yes
-
Referee: [Abstract] the claim of 'state-of-the-art results' and 'measurable quality gains' is stated without any numerical metrics, baseline SDR/SI-SDR values, or statistical significance tests, making it impossible to assess the magnitude or reliability of the reported improvements from the provided summary.
Authors: We agree that the abstract should contain concrete numbers. We will insert the key SDR/SI-SDR improvements (with baselines) for both the U-Net/Slakh2100 and BS-RoFormer/MUSDB18 settings, along with a brief statement on the number of runs used for the reported figures. revision: yes
Circularity Check
Empirical pipeline with no self-referential derivation or fitted predictions
full rationale
The paper presents an empirical method applying a diffusion model as post-processing refinement to existing separators, followed by consistency distillation for faster inference. Claims of measurable gains and architecture-agnostic behavior rest on separate training runs and evaluations (U-Net on Slakh2100; BS-RoFormer on MUSDB18), with no equations, first-principles derivations, or predictions that reduce by construction to the same fitted parameters or self-citations. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results appear; the work is a standard training-and-evaluation pipeline whose central results are externally falsifiable on held-out audio data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Albert S Bregman. Auditory scene analysis. In Proceedings of the 7th International Conference on Pattern Recognition, pp.\ 168--175. Citeseer, 1984
work page 1984
-
[3]
Some experiments on the recognition of speech, with one and with two ears
E Colin Cherry. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25 0 (5): 0 975--979, 1953
work page 1953
-
[4]
Lasaft: Latent source attentive frequency transformation for conditioned source separation
Woosung Choi, Minseok Kim, Jaehwa Chung, and Soonyoung Jung. Lasaft: Latent source attentive frequency transformation for conditioned source separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 171--175. IEEE, 2021
work page 2021
-
[5]
Hybrid spectrogram and waveform source separation
Alexandre D \'e fossez. Hybrid spectrogram and waveform source separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation, 2021
work page 2021
-
[6]
Music source separation in the waveform domain
Alexandre D \'e fossez, Nicolas Usunier, L \'e on Bottou, and Francis Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019
-
[7]
Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zal \' a n Borsos, Marco Tagliasacchi, Neil Zeghidour, and John R. Hershey. Tokensplit: Using discrete speech representations for direct, refined, and transcript-conditioned speech separation and recognition. In 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dubli...
work page 2023
-
[8]
Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C
Ian J. Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp.\ 2672--2680, 2014
work page 2014
-
[9]
Grais, Mehmet Umut Sen, and Hakan Erdogan
Emad M. Grais, Mehmet Umut Sen, and Hakan Erdogan. Deep neural networks for single channel source separation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014 , pp.\ 3734--3738. IEEE , 2014
work page 2014
-
[10]
On loss functions and evaluation metrics for music source separation
Enric Gus \'o , Jordi Pons, Santiago Pascual, and Joan Serr \`a . On loss functions and evaluation metrics for music source separation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 306--310. IEEE, 2022
work page 2022
-
[11]
Spleeter: a fast and efficient music source separation tool with pre-trained models
Romain Hennequin, Anis Khlif, F \' e lix Voituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. J. Open Source Softw., 5 0 (56): 0 2154, 2020
work page 2020
-
[12]
Diffusion-based signal refiner for speech separation
Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, and Yuki Mitsufuji. Diffusion-based signal refiner for speech separation. arXiv preprint arXiv:2305.05857, 2023
-
[13]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp.\ 6840--6851, 2020
work page 2020
-
[14]
Davis: High-quality audio-visual separation with generative diffusion models
Chaorui Huang, Susan Liang, Yapeng Tian, Anurag Kumar, and Chenliang Xu. Davis: High-quality audio-visual separation with generative diffusion models. arXiv:2308.00122, 2023
-
[15]
Parallel and flexible sampling from autoregressive models via langevin dynamics
Vivek Jayaram and John Thickstun. Parallel and flexible sampling from autoregressive models via langevin dynamics. In Proc. ICML, pp.\ 4807--4818. PMLR, 2021
work page 2021
-
[16]
Simultaneous music separation and generation using multi-track latent diffusion models
Tornike Karchkhadze, Mohammad Rasool Izadi, and Shlomo Dubnov. Simultaneous music separation and generation using multi-track latent diffusion models. arXiv preprint arXiv:2409.12346, 2024
-
[17]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[18]
Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R Hershey. Universal sound separation. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.\ 175--179. IEEE, 2019
work page 2019
-
[19]
Consistency trajectory models: Learning probability flow ODE trajectory of diffusion
Dongjun Kim, Chieh - Hsin Lai, Wei - Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024
work page 2024
-
[20]
Qiuqiang Kong, Yong Xu, Wenwu Wang, Philip J. B. Jackson, and Mark D. Plumbley. Single-channel signal separation and deconvolution with generative adversarial networks. In Proc. IJCAI, pp.\ 2747–2753. AAAI Press, 2019. ISBN 9780999241141
work page 2019
-
[21]
Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation
Jean - Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation. IEEE ACM Trans. Audio Speech Lang. Process. , 31: 0 2724--2737, 2023 a
work page 2023
-
[22]
Wind noise reduction with a diffusion-based stochastic regeneration model
Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, and Timo Gerkmann. Wind noise reduction with a diffusion-based stochastic regeneration model. In Speech Communication; 15th ITG Conference, pp.\ 116--120, 2023 b
work page 2023
-
[23]
Jen - Yu Liu and Yi - Hsuan Yang. Denoising auto-encoder with recurrent skip connections and residual regression for music source separation. In M. Arif Wani, Mehmed M. Kantardzic, Moamar Sayed Mouchaweh, Jo \ a o Gama, and Edwin Lughofer (eds.), 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, Decembe...
work page 2018
-
[24]
On the variance of the adaptive learning rate and beyond
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In ICLR, 2020
work page 2020
-
[25]
Francesc Lluís, Jordi Pons, and Xavier Serra. End-to-end music source separation: Is it possible in the waveform domain? In INTERSPEECH, pp.\ 4619--4623, 2019
work page 2019
-
[26]
Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022
work page 2022
-
[27]
Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation
Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27 0 (8): 0 1256--1266, 2019
work page 2019
-
[28]
Separate and diffuse: Using a pretrained diffusion model for better source separation
Shahar Lutati, Eliya Nachmani, and Lior Wolf. Separate and diffuse: Using a pretrained diffusion model for better source separation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024
work page 2024
-
[29]
Ethan Manilow, Gordon Wichern, Prem Seetharaman, and Jonathan Le Roux. Cutting music source separation some Slakh : A dataset to study the impact of training data quality and quantity. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019
work page 2019
-
[30]
Improving source separation by explicitly modeling dependencies between sources
Ethan Manilow, Curtis Hawthorne, Cheng-Zhi Anna Huang, Bryan Pardo, and Jesse Engel. Improving source separation by explicitly modeling dependencies between sources. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 291--295. IEEE, 2022
work page 2022
-
[31]
Multi-source diffusion models for simultaneous music generation and separation
Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, and Emanuele Rodol \` a . Multi-source diffusion models for simultaneous music generation and separation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024
work page 2024
-
[32]
Stephen McAdams and Albert Bregman. Hearing musical streams. Computer Music Journal, pp.\ 26--60, 1979
work page 1979
-
[33]
B.C.J. Moore. An Introduction to the Psychology of Hearing. Emerald, 2012. ISBN 9781780520384
work page 2012
-
[34]
Thiagarajan, Rushil Anirudh, and Andreas Spanias
Vivek Narayanaswamy, Jayaraman J. Thiagarajan, Rushil Anirudh, and Andreas Spanias. Unsupervised audio source separation using generative priors. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-October: 0 2657--2661, 2020. doi:10.21437/Interspeech.2020-3115
-
[35]
Multichannel music separation with deep neural networks
Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent. Multichannel music separation with deep neural networks. In 24th European Signal Processing Conference, EUSIPCO 2016, Budapest, Hungary, August 29 - September 2, 2016 , pp.\ 1748--1752. IEEE , 2016
work page 2016
-
[36]
A diffusion-inspired training strategy for singing voice extraction in the waveform domain
Gen \'i s Plaja-Roglans, Miron Marius, and Xavier Serra. A diffusion-inspired training strategy for singing voice extraction in the waveform domain. In Proc. of the 23rd Int. Society for Music Information Retrieval, 2022
work page 2022
-
[37]
Latent autoregressive source separation
Emilian Postolache, Giorgio Mariani, Michele Mancusi, Andrea Santilli, Luca Cosmo, and Emanuele Rodol\`a. Latent autoregressive source separation. In Proc. AAAI, AAAI Press, 2023 a
work page 2023
-
[38]
Adversarial permutation invariant training for universal sound separation
Emilian Postolache, Jordi Pons, Santiago Pascual, and Joan Serr \`a . Adversarial permutation invariant training for universal sound separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023 b
work page 2023
-
[39]
The MUSDB18 corpus for music separation, December 2017
Zafar Rafii, Antoine Liutkus, Fabian-Robert St \"o ter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, December 2017
work page 2017
-
[40]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention--MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.\ 234--241. Springer, 2015
work page 2015
-
[41]
Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey. Sdr – half-baked or well done? In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 626--630, 2019
work page 2019
-
[42]
Diffusion-based generative speech source separation
Robin Scheibler, Youna Ji, Soo-Whan Chung, Jaeuk Byun, Soyeon Choe, and Min-Seok Choi. Diffusion-based generative speech source separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023
work page 2023
-
[43]
Mo \^ u sai: Efficient text-to-music diffusion models
Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Sch \" o lkopf. Mo \^ u sai: Efficient text-to-music diffusion models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 , pp.\ 8050--8068. Association for Computational Linguistics, 2024
work page 2024
-
[44]
Diffusion-based speech enhancement with joint generative and predictive decoders
Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, and Yuki Mitsufuji. Diffusion-based speech enhancement with joint generative and predictive decoders. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 12951--12955. IEEE, 2024
work page 2024
-
[45]
Weiss, Niru Maheswaranathan, and Surya Ganguli
Jascha Sohl - Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis R. Bach and David M. Blei (eds.), Proceedings ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of JMLR Workshop and Conference Proceedings , pp.\ 2256--2265. JMLR.org, 2015
work page 2015
-
[46]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020
work page 2020
-
[47]
Improved techniques for training consistency models
Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. In ICLR, 2024
work page 2024
-
[48]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, ...
work page 2019
-
[49]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021
work page 2021
-
[50]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research, pp.\ 32211--32252. PMLR , 2023
work page 2023
-
[51]
Wave-u-net: A multi-scale neural network for end-to-end audio source separation
Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018 , pp.\ 334--340, 2018
work page 2018
-
[52]
Generative adversarial source separation
Y Cem Subakan and Paris Smaragdis. Generative adversarial source separation. In Proc. ICASSP, pp.\ 26--30. IEEE, 2018
work page 2018
-
[53]
Multi-scale multi-band densenets for audio source separation
Naoya Takahashi and Yuki Mitsufuji. Multi-scale multi-band densenets for audio source separation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2017, New Paltz, NY, USA, October 15-18, 2017 , pp.\ 21--25. IEEE , 2017
work page 2017
-
[54]
D3net: Densely connected multidilated densenet for music source separation
Naoya Takahashi and Yuki Mitsufuji. D3net: Densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733, 2020
-
[55]
Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji. Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation. In Proc. IWAENC, pp.\ 106--110, 2018. doi:10.1109/IWAENC.2018.8521383
-
[56]
Deep neural network based instrument extraction from music
Stefan Uhlich, Franck Giron, and Yuki Mitsufuji. Deep neural network based instrument extraction from music. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015 , pp.\ 2135--2139. IEEE , 2015
work page 2015
-
[57]
Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, Naoya Takahashi, and Yuki Mitsufuji. Improving music source separation based on deep neural networks through data augmentation and network blending. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017 ,...
work page 2017
-
[58]
WaveNet: A Generative Model for Raw Audio
Aäron van den Oord , Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio . In Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), pp.\ 125, 2016
work page 2016
-
[59]
Unsupervised sound separation using mixture invariant training
Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron Weiss, Kevin Wilson, and John Hershey. Unsupervised sound separation using mixture invariant training. Advances in Neural Information Processing Systems, 33: 0 3846--3857, 2020
work page 2020
-
[60]
Zero-shot duet singing voices separation with diffusion models
Chin-Yun Yu, Emilian Postolache, Emanuele Rodolà, and György Fazekas. Zero-shot duet singing voices separation with diffusion models. arXiv:2311.07345, 2023
-
[61]
Music source separation with generative flow
Ge Zhu, Jordan Darefsky, Fei Jiang, Anton Selitskiy, and Zhiyao Duan. Music source separation with generative flow. IEEE Signal Process. Lett. , 29: 0 2288--2292, 2022. doi:10.1109/LSP.2022.3219355
-
[62]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[63]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[64]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.