pith. sign in

arxiv: 2412.06965 · v2 · submitted 2024-12-09 · 💻 cs.SD · eess.AS

Improving Music Source Separation with Diffusion and Consistency Refinement

Pith reviewed 2026-05-23 07:30 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords music source separationdiffusion modelsconsistency distillationaudio refinementgenerative modelssource separationSlakh2100MUSDB18
0
0 comments X

The pith

A diffusion model refines outputs from any music source separator and consistency distillation reduces the process to a single step while preserving gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a generative diffusion model can serve as a final refinement stage after any deterministic separator, iteratively denoising the separated tracks to raise overall quality. The authors demonstrate this by training the refiner on the outputs of one separator and then testing the same model on a completely different architecture, obtaining measurable improvements on both Slakh2100 and MUSDB18. Because full diffusion sampling is slow, they further distill the model into a consistency model that matches or exceeds the multi-step results in one forward pass. Readers interested in audio editing would care if the same lightweight add-on could be dropped onto existing separators without retraining the base model.

Core claim

The paper claims that training a diffusion model on the separated sources produced by a deterministic separator, then distilling it for consistency, produces a general-purpose refiner that raises separation quality and reaches state-of-the-art scores when attached to either a custom U-Net on Slakh2100 or the BS-RoFormer on MUSDB18, with the distilled version requiring only one inference step.

What carries the argument

The consistency-distilled diffusion refiner: a generative model that takes a base separator's output and applies learned denoising to correct residual interference, distilled so that one step approximates the full iterative process.

If this is right

  • Quality gains appear when the refiner is placed after a U-Net separator on Slakh2100.
  • State-of-the-art scores are reached when the same refiner is placed after BS-RoFormer on MUSDB18.
  • Single-step inference from the distilled model maintains the quality improvement of the original diffusion process.
  • Two or more steps from the distilled model exceed the quality of the undiluted diffusion refiner.
  • No architecture-specific retraining of the refiner is required to obtain the reported gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same refinement stage could be tested on other audio tasks such as enhancement or dereverberation that already use deterministic front-ends.
  • If the refiner generalizes across backbones, designers might deliberately train lighter or faster base separators and rely on the distilled stage for final quality.
  • Training the refiner on a mixture of outputs from several different separators might further increase robustness to domain shift.

Load-bearing premise

A diffusion model trained on the outputs of one separator can be applied as a general last-stage refiner to the outputs of other separators without introducing new artifacts or domain-shift problems.

What would settle it

If the distilled refiner applied to a new separator produces lower objective separation scores or audible artifacts relative to the base separator alone, the claim of architecture-agnostic refinement would be falsified.

Figures

Figures reproduced from arXiv: 2412.06965 by Mohammad Rasool Izadi, Shlomo Dubnov, Shuo Zhang, Tornike Karchkhadze.

Figure 1
Figure 1. Figure 1: Diagram illustrating our proposed method. (a) First, we train a mixture-conditional deterministic source extraction model. (b) Next, we introduce a denoising score-matching diffusion model, conditioned both on the features extracted by the deterministic model and instrument label, which farther enhances extracted audio quality through noise addition and removal. In recent years, there has been a shift in f… view at source ↗
Figure 2
Figure 2. Figure 2: SI-SDRi Avg. vs Log(σmax) for CD and Diffusion Models across 5 Steps. Each subplot compares the performance of the diffusion model (red-square) and the consistency distillation model (blue-o) across different numbers of denoising steps, with a gray dashed line representing the performance of the deterministic model. The x-axis represents σmax, the starting noise levels for the models, given in a logarithmi… view at source ↗
read the original abstract

In this work, we propose an approach to music source separation that uses a generative diffusion model as a last-stage refinement on top of a deterministic separator, progressively enhancing the separated sources through iterative denoising. While the diffusion refinement yields measurable quality gains, it requires iterative steps at inference, increasing computational cost. To speed up the inference process, we apply consistency distillation, reducing inference to a single step while maintaining quality; with two or more steps, the distilled model even surpasses the diffusion-based approach. Crucially, our method is architecture-agnostic: we demonstrate state-of-the-art results when applied to both a custom U-Net-based separator on Slakh2100 and the state-of-the-art BS-RoFormer model on MUSDB18, showing that the refinement generalizes across backbone architectures. Sound examples are available at: https://consistency-separation.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a post-processing pipeline for music source separation that applies a generative diffusion model to iteratively refine the outputs of a deterministic separator, followed by consistency distillation to reduce inference to one or a few steps. It reports measurable quality improvements and claims the method is architecture-agnostic, achieving state-of-the-art results when applied to a custom U-Net on Slakh2100 and to BS-RoFormer on MUSDB18.

Significance. If the generalization claim holds, the approach would offer a modular, architecture-independent refinement stage that can be distilled for fast inference, potentially improving a range of existing separators with limited added cost. The consistency-distillation component addresses a practical inference bottleneck and is a clear technical contribution if the quality retention is rigorously quantified.

major comments (3)
  1. [Abstract / Experiments] Abstract and experimental claims: the assertion that the method is 'architecture-agnostic' and that 'the refinement generalizes across backbone architectures' is not supported by the reported experiments. The diffusion model is trained separately on the outputs of each specific separator (U-Net outputs on Slakh2100; BS-RoFormer outputs on MUSDB18), so the results demonstrate per-backbone training rather than zero-shot or cross-backbone transfer of a single refiner.
  2. [Experiments] Experimental design (results section): because the training distributions are tied to both the backbone outputs and the dataset (Slakh2100 vs. MUSDB18), the setup confounds backbone generalization with dataset shift. A load-bearing test for the central claim would require training the refiner on one backbone/dataset and evaluating it on the other without retraining.
  3. [Abstract] Abstract: the claim of 'state-of-the-art results' and 'measurable quality gains' is stated without any numerical metrics, baseline SDR/SI-SDR values, or statistical significance tests, making it impossible to assess the magnitude or reliability of the reported improvements from the provided summary.
minor comments (2)
  1. [Abstract] The abstract states that consistency distillation 'maintains quality' or 'surpasses' the diffusion baseline with two or more steps, but does not specify the exact number of steps, the distillation loss, or the quantitative comparison table that would allow readers to verify the trade-off.
  2. [Abstract / Results] Sound examples are referenced via a URL, but the manuscript should include a table or figure summarizing the objective metrics (e.g., SDR) for the main results to make the claims self-contained.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that several claims in the abstract require clarification and supporting numbers, and we will revise accordingly. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract / Experiments] the assertion that the method is 'architecture-agnostic' and that 'the refinement generalizes across backbone architectures' is not supported by the reported experiments. The diffusion model is trained separately on the outputs of each specific separator (U-Net outputs on Slakh2100; BS-RoFormer outputs on MUSDB18), so the results demonstrate per-backbone training rather than zero-shot or cross-backbone transfer of a single refiner.

    Authors: We accept the distinction. Our experiments demonstrate that the proposed refinement pipeline (diffusion + consistency distillation) can be successfully trained and applied to two different separator architectures, but they do not show zero-shot transfer of a single refiner. We will revise the abstract and introduction to replace 'architecture-agnostic' with the more precise phrasing 'applicable to multiple backbone architectures when the refiner is trained on the corresponding separator outputs.' revision: partial

  2. Referee: [Experiments] because the training distributions are tied to both the backbone outputs and the dataset (Slakh2100 vs. MUSDB18), the setup confounds backbone generalization with dataset shift. A load-bearing test for the central claim would require training the refiner on one backbone/dataset and evaluating it on the other without retraining.

    Authors: The referee correctly identifies the confounding. The current results cannot isolate backbone effects from dataset effects. We will add an explicit limitations paragraph acknowledging that a cross-backbone, cross-dataset transfer experiment was not performed and would be a valuable direction for future work. revision: yes

  3. Referee: [Abstract] the claim of 'state-of-the-art results' and 'measurable quality gains' is stated without any numerical metrics, baseline SDR/SI-SDR values, or statistical significance tests, making it impossible to assess the magnitude or reliability of the reported improvements from the provided summary.

    Authors: We agree that the abstract should contain concrete numbers. We will insert the key SDR/SI-SDR improvements (with baselines) for both the U-Net/Slakh2100 and BS-RoFormer/MUSDB18 settings, along with a brief statement on the number of runs used for the reported figures. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline with no self-referential derivation or fitted predictions

full rationale

The paper presents an empirical method applying a diffusion model as post-processing refinement to existing separators, followed by consistency distillation for faster inference. Claims of measurable gains and architecture-agnostic behavior rest on separate training runs and evaluations (U-Net on Slakh2100; BS-RoFormer on MUSDB18), with no equations, first-principles derivations, or predictions that reduce by construction to the same fitted parameters or self-citations. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results appear; the work is a standard training-and-evaluation pipeline whose central results are externally falsifiable on held-out audio data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; typical diffusion training involves many implicit hyperparameters whose values are not reported.

pith-pipeline@v0.9.0 · 5685 in / 983 out tokens · 15027 ms · 2026-05-23T07:30:10.675977+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Auditory scene analysis

    Albert S Bregman. Auditory scene analysis. In Proceedings of the 7th International Conference on Pattern Recognition, pp.\ 168--175. Citeseer, 1984

  3. [3]

    Some experiments on the recognition of speech, with one and with two ears

    E Colin Cherry. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25 0 (5): 0 975--979, 1953

  4. [4]

    Lasaft: Latent source attentive frequency transformation for conditioned source separation

    Woosung Choi, Minseok Kim, Jaehwa Chung, and Soonyoung Jung. Lasaft: Latent source attentive frequency transformation for conditioned source separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 171--175. IEEE, 2021

  5. [5]

    Hybrid spectrogram and waveform source separation

    Alexandre D \'e fossez. Hybrid spectrogram and waveform source separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation, 2021

  6. [6]

    Music source separation in the waveform domain

    Alexandre D \'e fossez, Nicolas Usunier, L \'e on Bottou, and Francis Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019

  7. [7]

    Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zal \' a n Borsos, Marco Tagliasacchi, Neil Zeghidour, and John R. Hershey. Tokensplit: Using discrete speech representations for direct, refined, and transcript-conditioned speech separation and recognition. In 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dubli...

  8. [8]

    Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C

    Ian J. Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp.\ 2672--2680, 2014

  9. [9]

    Grais, Mehmet Umut Sen, and Hakan Erdogan

    Emad M. Grais, Mehmet Umut Sen, and Hakan Erdogan. Deep neural networks for single channel source separation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014 , pp.\ 3734--3738. IEEE , 2014

  10. [10]

    On loss functions and evaluation metrics for music source separation

    Enric Gus \'o , Jordi Pons, Santiago Pascual, and Joan Serr \`a . On loss functions and evaluation metrics for music source separation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 306--310. IEEE, 2022

  11. [11]

    Spleeter: a fast and efficient music source separation tool with pre-trained models

    Romain Hennequin, Anis Khlif, F \' e lix Voituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. J. Open Source Softw., 5 0 (56): 0 2154, 2020

  12. [12]

    Diffusion-based signal refiner for speech separation

    Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, and Yuki Mitsufuji. Diffusion-based signal refiner for speech separation. arXiv preprint arXiv:2305.05857, 2023

  13. [13]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp.\ 6840--6851, 2020

  14. [14]

    Davis: High-quality audio-visual separation with generative diffusion models

    Chaorui Huang, Susan Liang, Yapeng Tian, Anurag Kumar, and Chenliang Xu. Davis: High-quality audio-visual separation with generative diffusion models. arXiv:2308.00122, 2023

  15. [15]

    Parallel and flexible sampling from autoregressive models via langevin dynamics

    Vivek Jayaram and John Thickstun. Parallel and flexible sampling from autoregressive models via langevin dynamics. In Proc. ICML, pp.\ 4807--4818. PMLR, 2021

  16. [16]

    Simultaneous music separation and generation using multi-track latent diffusion models

    Tornike Karchkhadze, Mohammad Rasool Izadi, and Shlomo Dubnov. Simultaneous music separation and generation using multi-track latent diffusion models. arXiv preprint arXiv:2409.12346, 2024

  17. [17]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022

  18. [18]

    Universal sound separation

    Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R Hershey. Universal sound separation. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.\ 175--179. IEEE, 2019

  19. [19]

    Consistency trajectory models: Learning probability flow ODE trajectory of diffusion

    Dongjun Kim, Chieh - Hsin Lai, Wei - Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024

  20. [20]

    Qiuqiang Kong, Yong Xu, Wenwu Wang, Philip J. B. Jackson, and Mark D. Plumbley. Single-channel signal separation and deconvolution with generative adversarial networks. In Proc. IJCAI, pp.\ 2747–2753. AAAI Press, 2019. ISBN 9780999241141

  21. [21]

    Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation

    Jean - Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation. IEEE ACM Trans. Audio Speech Lang. Process. , 31: 0 2724--2737, 2023 a

  22. [22]

    Wind noise reduction with a diffusion-based stochastic regeneration model

    Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, and Timo Gerkmann. Wind noise reduction with a diffusion-based stochastic regeneration model. In Speech Communication; 15th ITG Conference, pp.\ 116--120, 2023 b

  23. [23]

    Denoising auto-encoder with recurrent skip connections and residual regression for music source separation

    Jen - Yu Liu and Yi - Hsuan Yang. Denoising auto-encoder with recurrent skip connections and residual regression for music source separation. In M. Arif Wani, Mehmed M. Kantardzic, Moamar Sayed Mouchaweh, Jo \ a o Gama, and Edwin Lughofer (eds.), 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, Decembe...

  24. [24]

    On the variance of the adaptive learning rate and beyond

    Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In ICLR, 2020

  25. [25]

    End-to-end music source separation: Is it possible in the waveform domain? In INTERSPEECH, pp.\ 4619--4623, 2019

    Francesc Lluís, Jordi Pons, and Xavier Serra. End-to-end music source separation: Is it possible in the waveform domain? In INTERSPEECH, pp.\ 4619--4623, 2019

  26. [26]

    Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022

  27. [27]

    Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation

    Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27 0 (8): 0 1256--1266, 2019

  28. [28]

    Separate and diffuse: Using a pretrained diffusion model for better source separation

    Shahar Lutati, Eliya Nachmani, and Lior Wolf. Separate and diffuse: Using a pretrained diffusion model for better source separation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024

  29. [29]

    Cutting music source separation some Slakh : A dataset to study the impact of training data quality and quantity

    Ethan Manilow, Gordon Wichern, Prem Seetharaman, and Jonathan Le Roux. Cutting music source separation some Slakh : A dataset to study the impact of training data quality and quantity. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019

  30. [30]

    Improving source separation by explicitly modeling dependencies between sources

    Ethan Manilow, Curtis Hawthorne, Cheng-Zhi Anna Huang, Bryan Pardo, and Jesse Engel. Improving source separation by explicitly modeling dependencies between sources. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 291--295. IEEE, 2022

  31. [31]

    Multi-source diffusion models for simultaneous music generation and separation

    Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, and Emanuele Rodol \` a . Multi-source diffusion models for simultaneous music generation and separation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024

  32. [32]

    Hearing musical streams

    Stephen McAdams and Albert Bregman. Hearing musical streams. Computer Music Journal, pp.\ 26--60, 1979

  33. [33]

    B.C.J. Moore. An Introduction to the Psychology of Hearing. Emerald, 2012. ISBN 9781780520384

  34. [34]

    Thiagarajan, Rushil Anirudh, and Andreas Spanias

    Vivek Narayanaswamy, Jayaraman J. Thiagarajan, Rushil Anirudh, and Andreas Spanias. Unsupervised audio source separation using generative priors. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-October: 0 2657--2661, 2020. doi:10.21437/Interspeech.2020-3115

  35. [35]

    Multichannel music separation with deep neural networks

    Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent. Multichannel music separation with deep neural networks. In 24th European Signal Processing Conference, EUSIPCO 2016, Budapest, Hungary, August 29 - September 2, 2016 , pp.\ 1748--1752. IEEE , 2016

  36. [36]

    A diffusion-inspired training strategy for singing voice extraction in the waveform domain

    Gen \'i s Plaja-Roglans, Miron Marius, and Xavier Serra. A diffusion-inspired training strategy for singing voice extraction in the waveform domain. In Proc. of the 23rd Int. Society for Music Information Retrieval, 2022

  37. [37]

    Latent autoregressive source separation

    Emilian Postolache, Giorgio Mariani, Michele Mancusi, Andrea Santilli, Luca Cosmo, and Emanuele Rodol\`a. Latent autoregressive source separation. In Proc. AAAI, AAAI Press, 2023 a

  38. [38]

    Adversarial permutation invariant training for universal sound separation

    Emilian Postolache, Jordi Pons, Santiago Pascual, and Joan Serr \`a . Adversarial permutation invariant training for universal sound separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023 b

  39. [39]

    The MUSDB18 corpus for music separation, December 2017

    Zafar Rafii, Antoine Liutkus, Fabian-Robert St \"o ter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, December 2017

  40. [40]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention--MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.\ 234--241. Springer, 2015

  41. [41]

    Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey. Sdr – half-baked or well done? In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 626--630, 2019

  42. [42]

    Diffusion-based generative speech source separation

    Robin Scheibler, Youna Ji, Soo-Whan Chung, Jaeuk Byun, Soyeon Choe, and Min-Seok Choi. Diffusion-based generative speech source separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023

  43. [43]

    Mo \^ u sai: Efficient text-to-music diffusion models

    Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Sch \" o lkopf. Mo \^ u sai: Efficient text-to-music diffusion models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 , pp.\ 8050--8068. Association for Computational Linguistics, 2024

  44. [44]

    Diffusion-based speech enhancement with joint generative and predictive decoders

    Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, and Yuki Mitsufuji. Diffusion-based speech enhancement with joint generative and predictive decoders. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 12951--12955. IEEE, 2024

  45. [45]

    Weiss, Niru Maheswaranathan, and Surya Ganguli

    Jascha Sohl - Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis R. Bach and David M. Blei (eds.), Proceedings ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of JMLR Workshop and Conference Proceedings , pp.\ 2256--2265. JMLR.org, 2015

  46. [46]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020

  47. [47]

    Improved techniques for training consistency models

    Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. In ICLR, 2024

  48. [48]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, ...

  49. [49]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  50. [50]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research, pp.\ 32211--32252. PMLR , 2023

  51. [51]

    Wave-u-net: A multi-scale neural network for end-to-end audio source separation

    Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018 , pp.\ 334--340, 2018

  52. [52]

    Generative adversarial source separation

    Y Cem Subakan and Paris Smaragdis. Generative adversarial source separation. In Proc. ICASSP, pp.\ 26--30. IEEE, 2018

  53. [53]

    Multi-scale multi-band densenets for audio source separation

    Naoya Takahashi and Yuki Mitsufuji. Multi-scale multi-band densenets for audio source separation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2017, New Paltz, NY, USA, October 15-18, 2017 , pp.\ 21--25. IEEE , 2017

  54. [54]

    D3net: Densely connected multidilated densenet for music source separation

    Naoya Takahashi and Yuki Mitsufuji. D3net: Densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733, 2020

  55. [55]

    Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation

    Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji. Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation. In Proc. IWAENC, pp.\ 106--110, 2018. doi:10.1109/IWAENC.2018.8521383

  56. [56]

    Deep neural network based instrument extraction from music

    Stefan Uhlich, Franck Giron, and Yuki Mitsufuji. Deep neural network based instrument extraction from music. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015 , pp.\ 2135--2139. IEEE , 2015

  57. [57]

    Improving music source separation based on deep neural networks through data augmentation and network blending

    Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, Naoya Takahashi, and Yuki Mitsufuji. Improving music source separation based on deep neural networks through data augmentation and network blending. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017 ,...

  58. [58]

    WaveNet: A Generative Model for Raw Audio

    Aäron van den Oord , Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio . In Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), pp.\ 125, 2016

  59. [59]

    Unsupervised sound separation using mixture invariant training

    Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron Weiss, Kevin Wilson, and John Hershey. Unsupervised sound separation using mixture invariant training. Advances in Neural Information Processing Systems, 33: 0 3846--3857, 2020

  60. [60]

    Zero-shot duet singing voices separation with diffusion models

    Chin-Yun Yu, Emilian Postolache, Emanuele Rodolà, and György Fazekas. Zero-shot duet singing voices separation with diffusion models. arXiv:2311.07345, 2023

  61. [61]

    Music source separation with generative flow

    Ge Zhu, Jordan Darefsky, Fei Jiang, Anton Selitskiy, and Zhiyao Duan. Music source separation with generative flow. IEEE Signal Process. Lett. , 29: 0 2288--2292, 2022. doi:10.1109/LSP.2022.3219355

  62. [62]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  63. [63]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  64. [64]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...