Improving Music Source Separation with Diffusion and Consistency Refinement

Mohammad Rasool Izadi; Shlomo Dubnov; Shuo Zhang; Tornike Karchkhadze

arxiv: 2412.06965 · v2 · submitted 2024-12-09 · 💻 cs.SD · eess.AS

Improving Music Source Separation with Diffusion and Consistency Refinement

Tornike Karchkhadze , Mohammad Rasool Izadi , Shuo Zhang , Shlomo Dubnov This is my paper

Pith reviewed 2026-05-23 07:30 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords music source separationdiffusion modelsconsistency distillationaudio refinementgenerative modelssource separationSlakh2100MUSDB18

0 comments

The pith

A diffusion model refines outputs from any music source separator and consistency distillation reduces the process to a single step while preserving gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a generative diffusion model can serve as a final refinement stage after any deterministic separator, iteratively denoising the separated tracks to raise overall quality. The authors demonstrate this by training the refiner on the outputs of one separator and then testing the same model on a completely different architecture, obtaining measurable improvements on both Slakh2100 and MUSDB18. Because full diffusion sampling is slow, they further distill the model into a consistency model that matches or exceeds the multi-step results in one forward pass. Readers interested in audio editing would care if the same lightweight add-on could be dropped onto existing separators without retraining the base model.

Core claim

The paper claims that training a diffusion model on the separated sources produced by a deterministic separator, then distilling it for consistency, produces a general-purpose refiner that raises separation quality and reaches state-of-the-art scores when attached to either a custom U-Net on Slakh2100 or the BS-RoFormer on MUSDB18, with the distilled version requiring only one inference step.

What carries the argument

The consistency-distilled diffusion refiner: a generative model that takes a base separator's output and applies learned denoising to correct residual interference, distilled so that one step approximates the full iterative process.

If this is right

Quality gains appear when the refiner is placed after a U-Net separator on Slakh2100.
State-of-the-art scores are reached when the same refiner is placed after BS-RoFormer on MUSDB18.
Single-step inference from the distilled model maintains the quality improvement of the original diffusion process.
Two or more steps from the distilled model exceed the quality of the undiluted diffusion refiner.
No architecture-specific retraining of the refiner is required to obtain the reported gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same refinement stage could be tested on other audio tasks such as enhancement or dereverberation that already use deterministic front-ends.
If the refiner generalizes across backbones, designers might deliberately train lighter or faster base separators and rely on the distilled stage for final quality.
Training the refiner on a mixture of outputs from several different separators might further increase robustness to domain shift.

Load-bearing premise

A diffusion model trained on the outputs of one separator can be applied as a general last-stage refiner to the outputs of other separators without introducing new artifacts or domain-shift problems.

What would settle it

If the distilled refiner applied to a new separator produces lower objective separation scores or audible artifacts relative to the base separator alone, the claim of architecture-agnostic refinement would be falsified.

Figures

Figures reproduced from arXiv: 2412.06965 by Mohammad Rasool Izadi, Shlomo Dubnov, Shuo Zhang, Tornike Karchkhadze.

**Figure 1.** Figure 1: Diagram illustrating our proposed method. (a) First, we train a mixture-conditional deterministic source extraction model. (b) Next, we introduce a denoising score-matching diffusion model, conditioned both on the features extracted by the deterministic model and instrument label, which farther enhances extracted audio quality through noise addition and removal. In recent years, there has been a shift in f… view at source ↗

**Figure 2.** Figure 2: SI-SDRi Avg. vs Log(σmax) for CD and Diffusion Models across 5 Steps. Each subplot compares the performance of the diffusion model (red-square) and the consistency distillation model (blue-o) across different numbers of denoising steps, with a gray dashed line representing the performance of the deterministic model. The x-axis represents σmax, the starting noise levels for the models, given in a logarithmi… view at source ↗

read the original abstract

In this work, we propose an approach to music source separation that uses a generative diffusion model as a last-stage refinement on top of a deterministic separator, progressively enhancing the separated sources through iterative denoising. While the diffusion refinement yields measurable quality gains, it requires iterative steps at inference, increasing computational cost. To speed up the inference process, we apply consistency distillation, reducing inference to a single step while maintaining quality; with two or more steps, the distilled model even surpasses the diffusion-based approach. Crucially, our method is architecture-agnostic: we demonstrate state-of-the-art results when applied to both a custom U-Net-based separator on Slakh2100 and the state-of-the-art BS-RoFormer model on MUSDB18, showing that the refinement generalizes across backbone architectures. Sound examples are available at: https://consistency-separation.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a diffusion refinement stage plus consistency distillation to music source separation and reports gains on two backbones, but the architecture-agnostic claim is not backed by cross-backbone tests since the refiner is trained separately on different datasets.

read the letter

The main point is a modular last-stage refiner: run a deterministic separator, then apply a diffusion model for iterative denoising to clean up the sources, and distill it to a consistency model so inference drops to one step while quality holds or improves. They test this after a custom U-Net on Slakh2100 and after BS-RoFormer on MUSDB18, claiming SOTA in both cases. The distillation step is the concrete addition here; it is not a routine extension of the cited diffusion work in separation. The paper does well by keeping the backbone untouched and by supplying listening examples so readers can judge the perceptual difference directly. The experimental pipeline is described clearly enough to reproduce the training flow. The soft spot is the architecture-agnostic claim. The refiner is trained on the specific outputs of each separator and on mismatched datasets, so there is no demonstration that one distilled model can be dropped onto a new backbone without retraining or domain shift. That directly undercuts the strongest assertion in the abstract. The rest of the method looks reproducible and the gains appear real on the reported setups, but the generalization argument needs either a shared refiner experiment or more cautious wording. This is for people already working on music source separation who want a practical add-on rather than a new backbone. A reader in that subfield can extract the distillation recipe and try it. It deserves peer review because the core technique is grounded in experiments and the limitations are fixable with targeted additions rather than a rewrite.

Referee Report

3 major / 2 minor

Summary. The paper proposes a post-processing pipeline for music source separation that applies a generative diffusion model to iteratively refine the outputs of a deterministic separator, followed by consistency distillation to reduce inference to one or a few steps. It reports measurable quality improvements and claims the method is architecture-agnostic, achieving state-of-the-art results when applied to a custom U-Net on Slakh2100 and to BS-RoFormer on MUSDB18.

Significance. If the generalization claim holds, the approach would offer a modular, architecture-independent refinement stage that can be distilled for fast inference, potentially improving a range of existing separators with limited added cost. The consistency-distillation component addresses a practical inference bottleneck and is a clear technical contribution if the quality retention is rigorously quantified.

major comments (3)

[Abstract / Experiments] Abstract and experimental claims: the assertion that the method is 'architecture-agnostic' and that 'the refinement generalizes across backbone architectures' is not supported by the reported experiments. The diffusion model is trained separately on the outputs of each specific separator (U-Net outputs on Slakh2100; BS-RoFormer outputs on MUSDB18), so the results demonstrate per-backbone training rather than zero-shot or cross-backbone transfer of a single refiner.
[Experiments] Experimental design (results section): because the training distributions are tied to both the backbone outputs and the dataset (Slakh2100 vs. MUSDB18), the setup confounds backbone generalization with dataset shift. A load-bearing test for the central claim would require training the refiner on one backbone/dataset and evaluating it on the other without retraining.
[Abstract] Abstract: the claim of 'state-of-the-art results' and 'measurable quality gains' is stated without any numerical metrics, baseline SDR/SI-SDR values, or statistical significance tests, making it impossible to assess the magnitude or reliability of the reported improvements from the provided summary.

minor comments (2)

[Abstract] The abstract states that consistency distillation 'maintains quality' or 'surpasses' the diffusion baseline with two or more steps, but does not specify the exact number of steps, the distillation loss, or the quantitative comparison table that would allow readers to verify the trade-off.
[Abstract / Results] Sound examples are referenced via a URL, but the manuscript should include a table or figure summarizing the objective metrics (e.g., SDR) for the main results to make the claims self-contained.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that several claims in the abstract require clarification and supporting numbers, and we will revise accordingly. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract / Experiments] the assertion that the method is 'architecture-agnostic' and that 'the refinement generalizes across backbone architectures' is not supported by the reported experiments. The diffusion model is trained separately on the outputs of each specific separator (U-Net outputs on Slakh2100; BS-RoFormer outputs on MUSDB18), so the results demonstrate per-backbone training rather than zero-shot or cross-backbone transfer of a single refiner.

Authors: We accept the distinction. Our experiments demonstrate that the proposed refinement pipeline (diffusion + consistency distillation) can be successfully trained and applied to two different separator architectures, but they do not show zero-shot transfer of a single refiner. We will revise the abstract and introduction to replace 'architecture-agnostic' with the more precise phrasing 'applicable to multiple backbone architectures when the refiner is trained on the corresponding separator outputs.' revision: partial
Referee: [Experiments] because the training distributions are tied to both the backbone outputs and the dataset (Slakh2100 vs. MUSDB18), the setup confounds backbone generalization with dataset shift. A load-bearing test for the central claim would require training the refiner on one backbone/dataset and evaluating it on the other without retraining.

Authors: The referee correctly identifies the confounding. The current results cannot isolate backbone effects from dataset effects. We will add an explicit limitations paragraph acknowledging that a cross-backbone, cross-dataset transfer experiment was not performed and would be a valuable direction for future work. revision: yes
Referee: [Abstract] the claim of 'state-of-the-art results' and 'measurable quality gains' is stated without any numerical metrics, baseline SDR/SI-SDR values, or statistical significance tests, making it impossible to assess the magnitude or reliability of the reported improvements from the provided summary.

Authors: We agree that the abstract should contain concrete numbers. We will insert the key SDR/SI-SDR improvements (with baselines) for both the U-Net/Slakh2100 and BS-RoFormer/MUSDB18 settings, along with a brief statement on the number of runs used for the reported figures. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline with no self-referential derivation or fitted predictions

full rationale

The paper presents an empirical method applying a diffusion model as post-processing refinement to existing separators, followed by consistency distillation for faster inference. Claims of measurable gains and architecture-agnostic behavior rest on separate training runs and evaluations (U-Net on Slakh2100; BS-RoFormer on MUSDB18), with no equations, first-principles derivations, or predictions that reduce by construction to the same fitted parameters or self-citations. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results appear; the work is a standard training-and-evaluation pipeline whose central results are externally falsifiable on held-out audio data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; typical diffusion training involves many implicit hyperparameters whose values are not reported.

pith-pipeline@v0.9.0 · 5685 in / 983 out tokens · 15027 ms · 2026-05-23T07:30:10.675977+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Auditory scene analysis

Albert S Bregman. Auditory scene analysis. In Proceedings of the 7th International Conference on Pattern Recognition, pp.\ 168--175. Citeseer, 1984

work page 1984
[3]

Some experiments on the recognition of speech, with one and with two ears

E Colin Cherry. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25 0 (5): 0 975--979, 1953

work page 1953
[4]

Lasaft: Latent source attentive frequency transformation for conditioned source separation

Woosung Choi, Minseok Kim, Jaehwa Chung, and Soonyoung Jung. Lasaft: Latent source attentive frequency transformation for conditioned source separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 171--175. IEEE, 2021

work page 2021
[5]

Hybrid spectrogram and waveform source separation

Alexandre D \'e fossez. Hybrid spectrogram and waveform source separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation, 2021

work page 2021
[6]

Music source separation in the waveform domain

Alexandre D \'e fossez, Nicolas Usunier, L \'e on Bottou, and Francis Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019

work page arXiv 1911
[7]

Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zal \' a n Borsos, Marco Tagliasacchi, Neil Zeghidour, and John R. Hershey. Tokensplit: Using discrete speech representations for direct, refined, and transcript-conditioned speech separation and recognition. In 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dubli...

work page 2023
[8]

Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C

Ian J. Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp.\ 2672--2680, 2014

work page 2014
[9]

Grais, Mehmet Umut Sen, and Hakan Erdogan

Emad M. Grais, Mehmet Umut Sen, and Hakan Erdogan. Deep neural networks for single channel source separation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014 , pp.\ 3734--3738. IEEE , 2014

work page 2014
[10]

On loss functions and evaluation metrics for music source separation

Enric Gus \'o , Jordi Pons, Santiago Pascual, and Joan Serr \`a . On loss functions and evaluation metrics for music source separation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 306--310. IEEE, 2022

work page 2022
[11]

Spleeter: a fast and efficient music source separation tool with pre-trained models

Romain Hennequin, Anis Khlif, F \' e lix Voituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. J. Open Source Softw., 5 0 (56): 0 2154, 2020

work page 2020
[12]

Diffusion-based signal refiner for speech separation

Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, and Yuki Mitsufuji. Diffusion-based signal refiner for speech separation. arXiv preprint arXiv:2305.05857, 2023

work page arXiv 2023
[13]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp.\ 6840--6851, 2020

work page 2020
[14]

Davis: High-quality audio-visual separation with generative diffusion models

Chaorui Huang, Susan Liang, Yapeng Tian, Anurag Kumar, and Chenliang Xu. Davis: High-quality audio-visual separation with generative diffusion models. arXiv:2308.00122, 2023

work page arXiv 2023
[15]

Parallel and flexible sampling from autoregressive models via langevin dynamics

Vivek Jayaram and John Thickstun. Parallel and flexible sampling from autoregressive models via langevin dynamics. In Proc. ICML, pp.\ 4807--4818. PMLR, 2021

work page 2021
[16]

Simultaneous music separation and generation using multi-track latent diffusion models

Tornike Karchkhadze, Mohammad Rasool Izadi, and Shlomo Dubnov. Simultaneous music separation and generation using multi-track latent diffusion models. arXiv preprint arXiv:2409.12346, 2024

work page arXiv 2024
[17]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022

work page 2022
[18]

Universal sound separation

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R Hershey. Universal sound separation. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.\ 175--179. IEEE, 2019

work page 2019
[19]

Consistency trajectory models: Learning probability flow ODE trajectory of diffusion

Dongjun Kim, Chieh - Hsin Lai, Wei - Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024

work page 2024
[20]

Qiuqiang Kong, Yong Xu, Wenwu Wang, Philip J. B. Jackson, and Mark D. Plumbley. Single-channel signal separation and deconvolution with generative adversarial networks. In Proc. IJCAI, pp.\ 2747–2753. AAAI Press, 2019. ISBN 9780999241141

work page 2019
[21]

Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation

Jean - Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation. IEEE ACM Trans. Audio Speech Lang. Process. , 31: 0 2724--2737, 2023 a

work page 2023
[22]

Wind noise reduction with a diffusion-based stochastic regeneration model

Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, and Timo Gerkmann. Wind noise reduction with a diffusion-based stochastic regeneration model. In Speech Communication; 15th ITG Conference, pp.\ 116--120, 2023 b

work page 2023
[23]

Denoising auto-encoder with recurrent skip connections and residual regression for music source separation

Jen - Yu Liu and Yi - Hsuan Yang. Denoising auto-encoder with recurrent skip connections and residual regression for music source separation. In M. Arif Wani, Mehmed M. Kantardzic, Moamar Sayed Mouchaweh, Jo \ a o Gama, and Edwin Lughofer (eds.), 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, Decembe...

work page 2018
[24]

On the variance of the adaptive learning rate and beyond

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In ICLR, 2020

work page 2020
[25]

End-to-end music source separation: Is it possible in the waveform domain? In INTERSPEECH, pp.\ 4619--4623, 2019

Francesc Lluís, Jordi Pons, and Xavier Serra. End-to-end music source separation: Is it possible in the waveform domain? In INTERSPEECH, pp.\ 4619--4623, 2019

work page 2019
[26]

Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022

work page 2022
[27]

Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation

Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27 0 (8): 0 1256--1266, 2019

work page 2019
[28]

Separate and diffuse: Using a pretrained diffusion model for better source separation

Shahar Lutati, Eliya Nachmani, and Lior Wolf. Separate and diffuse: Using a pretrained diffusion model for better source separation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024

work page 2024
[29]

Cutting music source separation some Slakh : A dataset to study the impact of training data quality and quantity

Ethan Manilow, Gordon Wichern, Prem Seetharaman, and Jonathan Le Roux. Cutting music source separation some Slakh : A dataset to study the impact of training data quality and quantity. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019

work page 2019
[30]

Improving source separation by explicitly modeling dependencies between sources

Ethan Manilow, Curtis Hawthorne, Cheng-Zhi Anna Huang, Bryan Pardo, and Jesse Engel. Improving source separation by explicitly modeling dependencies between sources. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 291--295. IEEE, 2022

work page 2022
[31]

Multi-source diffusion models for simultaneous music generation and separation

Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, and Emanuele Rodol \` a . Multi-source diffusion models for simultaneous music generation and separation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024

work page 2024
[32]

Hearing musical streams

Stephen McAdams and Albert Bregman. Hearing musical streams. Computer Music Journal, pp.\ 26--60, 1979

work page 1979
[33]

B.C.J. Moore. An Introduction to the Psychology of Hearing. Emerald, 2012. ISBN 9781780520384

work page 2012
[34]

Thiagarajan, Rushil Anirudh, and Andreas Spanias

Vivek Narayanaswamy, Jayaraman J. Thiagarajan, Rushil Anirudh, and Andreas Spanias. Unsupervised audio source separation using generative priors. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-October: 0 2657--2661, 2020. doi:10.21437/Interspeech.2020-3115

work page doi:10.21437/interspeech.2020-3115 2020
[35]

Multichannel music separation with deep neural networks

Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent. Multichannel music separation with deep neural networks. In 24th European Signal Processing Conference, EUSIPCO 2016, Budapest, Hungary, August 29 - September 2, 2016 , pp.\ 1748--1752. IEEE , 2016

work page 2016
[36]

A diffusion-inspired training strategy for singing voice extraction in the waveform domain

Gen \'i s Plaja-Roglans, Miron Marius, and Xavier Serra. A diffusion-inspired training strategy for singing voice extraction in the waveform domain. In Proc. of the 23rd Int. Society for Music Information Retrieval, 2022

work page 2022
[37]

Latent autoregressive source separation

Emilian Postolache, Giorgio Mariani, Michele Mancusi, Andrea Santilli, Luca Cosmo, and Emanuele Rodol\`a. Latent autoregressive source separation. In Proc. AAAI, AAAI Press, 2023 a

work page 2023
[38]

Adversarial permutation invariant training for universal sound separation

Emilian Postolache, Jordi Pons, Santiago Pascual, and Joan Serr \`a . Adversarial permutation invariant training for universal sound separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023 b

work page 2023
[39]

The MUSDB18 corpus for music separation, December 2017

Zafar Rafii, Antoine Liutkus, Fabian-Robert St \"o ter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, December 2017

work page 2017
[40]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention--MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.\ 234--241. Springer, 2015

work page 2015
[41]

Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey. Sdr – half-baked or well done? In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 626--630, 2019

work page 2019
[42]

Diffusion-based generative speech source separation

Robin Scheibler, Youna Ji, Soo-Whan Chung, Jaeuk Byun, Soyeon Choe, and Min-Seok Choi. Diffusion-based generative speech source separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023

work page 2023
[43]

Mo \^ u sai: Efficient text-to-music diffusion models

Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Sch \" o lkopf. Mo \^ u sai: Efficient text-to-music diffusion models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 , pp.\ 8050--8068. Association for Computational Linguistics, 2024

work page 2024
[44]

Diffusion-based speech enhancement with joint generative and predictive decoders

Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, and Yuki Mitsufuji. Diffusion-based speech enhancement with joint generative and predictive decoders. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 12951--12955. IEEE, 2024

work page 2024
[45]

Weiss, Niru Maheswaranathan, and Surya Ganguli

Jascha Sohl - Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis R. Bach and David M. Blei (eds.), Proceedings ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of JMLR Workshop and Conference Proceedings , pp.\ 2256--2265. JMLR.org, 2015

work page 2015
[46]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020

work page 2020
[47]

Improved techniques for training consistency models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. In ICLR, 2024

work page 2024
[48]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, ...

work page 2019
[49]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021
[50]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research, pp.\ 32211--32252. PMLR , 2023

work page 2023
[51]

Wave-u-net: A multi-scale neural network for end-to-end audio source separation

Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018 , pp.\ 334--340, 2018

work page 2018
[52]

Generative adversarial source separation

Y Cem Subakan and Paris Smaragdis. Generative adversarial source separation. In Proc. ICASSP, pp.\ 26--30. IEEE, 2018

work page 2018
[53]

Multi-scale multi-band densenets for audio source separation

Naoya Takahashi and Yuki Mitsufuji. Multi-scale multi-band densenets for audio source separation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2017, New Paltz, NY, USA, October 15-18, 2017 , pp.\ 21--25. IEEE , 2017

work page 2017
[54]

D3net: Densely connected multidilated densenet for music source separation

Naoya Takahashi and Yuki Mitsufuji. D3net: Densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733, 2020

work page arXiv 2010
[55]

Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation

Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji. Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation. In Proc. IWAENC, pp.\ 106--110, 2018. doi:10.1109/IWAENC.2018.8521383

work page doi:10.1109/iwaenc.2018.8521383 2018
[56]

Deep neural network based instrument extraction from music

Stefan Uhlich, Franck Giron, and Yuki Mitsufuji. Deep neural network based instrument extraction from music. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015 , pp.\ 2135--2139. IEEE , 2015

work page 2015
[57]

Improving music source separation based on deep neural networks through data augmentation and network blending

Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, Naoya Takahashi, and Yuki Mitsufuji. Improving music source separation based on deep neural networks through data augmentation and network blending. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017 ,...

work page 2017
[58]

WaveNet: A Generative Model for Raw Audio

Aäron van den Oord , Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio . In Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), pp.\ 125, 2016

work page 2016
[59]

Unsupervised sound separation using mixture invariant training

Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron Weiss, Kevin Wilson, and John Hershey. Unsupervised sound separation using mixture invariant training. Advances in Neural Information Processing Systems, 33: 0 3846--3857, 2020

work page 2020
[60]

Zero-shot duet singing voices separation with diffusion models

Chin-Yun Yu, Emilian Postolache, Emanuele Rodolà, and György Fazekas. Zero-shot duet singing voices separation with diffusion models. arXiv:2311.07345, 2023

work page arXiv 2023
[61]

Music source separation with generative flow

Ge Zhu, Jordan Darefsky, Fei Jiang, Anton Selitskiy, and Zhiyao Duan. Music source separation with generative flow. IEEE Signal Process. Lett. , 29: 0 2288--2292, 2022. doi:10.1109/LSP.2022.3219355

work page doi:10.1109/lsp.2022.3219355 2022
[62]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[63]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[64]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Auditory scene analysis

Albert S Bregman. Auditory scene analysis. In Proceedings of the 7th International Conference on Pattern Recognition, pp.\ 168--175. Citeseer, 1984

work page 1984

[3] [3]

Some experiments on the recognition of speech, with one and with two ears

E Colin Cherry. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25 0 (5): 0 975--979, 1953

work page 1953

[4] [4]

Lasaft: Latent source attentive frequency transformation for conditioned source separation

Woosung Choi, Minseok Kim, Jaehwa Chung, and Soonyoung Jung. Lasaft: Latent source attentive frequency transformation for conditioned source separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 171--175. IEEE, 2021

work page 2021

[5] [5]

Hybrid spectrogram and waveform source separation

Alexandre D \'e fossez. Hybrid spectrogram and waveform source separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation, 2021

work page 2021

[6] [6]

Music source separation in the waveform domain

Alexandre D \'e fossez, Nicolas Usunier, L \'e on Bottou, and Francis Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019

work page arXiv 1911

[7] [7]

Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zal \' a n Borsos, Marco Tagliasacchi, Neil Zeghidour, and John R. Hershey. Tokensplit: Using discrete speech representations for direct, refined, and transcript-conditioned speech separation and recognition. In 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dubli...

work page 2023

[8] [8]

Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C

Ian J. Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp.\ 2672--2680, 2014

work page 2014

[9] [9]

Grais, Mehmet Umut Sen, and Hakan Erdogan

Emad M. Grais, Mehmet Umut Sen, and Hakan Erdogan. Deep neural networks for single channel source separation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014 , pp.\ 3734--3738. IEEE , 2014

work page 2014

[10] [10]

On loss functions and evaluation metrics for music source separation

Enric Gus \'o , Jordi Pons, Santiago Pascual, and Joan Serr \`a . On loss functions and evaluation metrics for music source separation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 306--310. IEEE, 2022

work page 2022

[11] [11]

Spleeter: a fast and efficient music source separation tool with pre-trained models

Romain Hennequin, Anis Khlif, F \' e lix Voituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. J. Open Source Softw., 5 0 (56): 0 2154, 2020

work page 2020

[12] [12]

Diffusion-based signal refiner for speech separation

Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, and Yuki Mitsufuji. Diffusion-based signal refiner for speech separation. arXiv preprint arXiv:2305.05857, 2023

work page arXiv 2023

[13] [13]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp.\ 6840--6851, 2020

work page 2020

[14] [14]

Davis: High-quality audio-visual separation with generative diffusion models

Chaorui Huang, Susan Liang, Yapeng Tian, Anurag Kumar, and Chenliang Xu. Davis: High-quality audio-visual separation with generative diffusion models. arXiv:2308.00122, 2023

work page arXiv 2023

[15] [15]

Parallel and flexible sampling from autoregressive models via langevin dynamics

Vivek Jayaram and John Thickstun. Parallel and flexible sampling from autoregressive models via langevin dynamics. In Proc. ICML, pp.\ 4807--4818. PMLR, 2021

work page 2021

[16] [16]

Simultaneous music separation and generation using multi-track latent diffusion models

Tornike Karchkhadze, Mohammad Rasool Izadi, and Shlomo Dubnov. Simultaneous music separation and generation using multi-track latent diffusion models. arXiv preprint arXiv:2409.12346, 2024

work page arXiv 2024

[17] [17]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022

work page 2022

[18] [18]

Universal sound separation

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R Hershey. Universal sound separation. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.\ 175--179. IEEE, 2019

work page 2019

[19] [19]

Consistency trajectory models: Learning probability flow ODE trajectory of diffusion

Dongjun Kim, Chieh - Hsin Lai, Wei - Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024

work page 2024

[20] [20]

Qiuqiang Kong, Yong Xu, Wenwu Wang, Philip J. B. Jackson, and Mark D. Plumbley. Single-channel signal separation and deconvolution with generative adversarial networks. In Proc. IJCAI, pp.\ 2747–2753. AAAI Press, 2019. ISBN 9780999241141

work page 2019

[21] [21]

Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation

Jean - Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation. IEEE ACM Trans. Audio Speech Lang. Process. , 31: 0 2724--2737, 2023 a

work page 2023

[22] [22]

Wind noise reduction with a diffusion-based stochastic regeneration model

Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, and Timo Gerkmann. Wind noise reduction with a diffusion-based stochastic regeneration model. In Speech Communication; 15th ITG Conference, pp.\ 116--120, 2023 b

work page 2023

[23] [23]

Denoising auto-encoder with recurrent skip connections and residual regression for music source separation

Jen - Yu Liu and Yi - Hsuan Yang. Denoising auto-encoder with recurrent skip connections and residual regression for music source separation. In M. Arif Wani, Mehmed M. Kantardzic, Moamar Sayed Mouchaweh, Jo \ a o Gama, and Edwin Lughofer (eds.), 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, Decembe...

work page 2018

[24] [24]

On the variance of the adaptive learning rate and beyond

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In ICLR, 2020

work page 2020

[25] [25]

End-to-end music source separation: Is it possible in the waveform domain? In INTERSPEECH, pp.\ 4619--4623, 2019

Francesc Lluís, Jordi Pons, and Xavier Serra. End-to-end music source separation: Is it possible in the waveform domain? In INTERSPEECH, pp.\ 4619--4623, 2019

work page 2019

[26] [26]

Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022

work page 2022

[27] [27]

Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation

Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27 0 (8): 0 1256--1266, 2019

work page 2019

[28] [28]

Separate and diffuse: Using a pretrained diffusion model for better source separation

Shahar Lutati, Eliya Nachmani, and Lior Wolf. Separate and diffuse: Using a pretrained diffusion model for better source separation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024

work page 2024

[29] [29]

Cutting music source separation some Slakh : A dataset to study the impact of training data quality and quantity

Ethan Manilow, Gordon Wichern, Prem Seetharaman, and Jonathan Le Roux. Cutting music source separation some Slakh : A dataset to study the impact of training data quality and quantity. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019

work page 2019

[30] [30]

Improving source separation by explicitly modeling dependencies between sources

Ethan Manilow, Curtis Hawthorne, Cheng-Zhi Anna Huang, Bryan Pardo, and Jesse Engel. Improving source separation by explicitly modeling dependencies between sources. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 291--295. IEEE, 2022

work page 2022

[31] [31]

Multi-source diffusion models for simultaneous music generation and separation

Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, and Emanuele Rodol \` a . Multi-source diffusion models for simultaneous music generation and separation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024

work page 2024

[32] [32]

Hearing musical streams

Stephen McAdams and Albert Bregman. Hearing musical streams. Computer Music Journal, pp.\ 26--60, 1979

work page 1979

[33] [33]

B.C.J. Moore. An Introduction to the Psychology of Hearing. Emerald, 2012. ISBN 9781780520384

work page 2012

[34] [34]

Thiagarajan, Rushil Anirudh, and Andreas Spanias

Vivek Narayanaswamy, Jayaraman J. Thiagarajan, Rushil Anirudh, and Andreas Spanias. Unsupervised audio source separation using generative priors. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-October: 0 2657--2661, 2020. doi:10.21437/Interspeech.2020-3115

work page doi:10.21437/interspeech.2020-3115 2020

[35] [35]

Multichannel music separation with deep neural networks

Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent. Multichannel music separation with deep neural networks. In 24th European Signal Processing Conference, EUSIPCO 2016, Budapest, Hungary, August 29 - September 2, 2016 , pp.\ 1748--1752. IEEE , 2016

work page 2016

[36] [36]

A diffusion-inspired training strategy for singing voice extraction in the waveform domain

Gen \'i s Plaja-Roglans, Miron Marius, and Xavier Serra. A diffusion-inspired training strategy for singing voice extraction in the waveform domain. In Proc. of the 23rd Int. Society for Music Information Retrieval, 2022

work page 2022

[37] [37]

Latent autoregressive source separation

Emilian Postolache, Giorgio Mariani, Michele Mancusi, Andrea Santilli, Luca Cosmo, and Emanuele Rodol\`a. Latent autoregressive source separation. In Proc. AAAI, AAAI Press, 2023 a

work page 2023

[38] [38]

Adversarial permutation invariant training for universal sound separation

Emilian Postolache, Jordi Pons, Santiago Pascual, and Joan Serr \`a . Adversarial permutation invariant training for universal sound separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023 b

work page 2023

[39] [39]

The MUSDB18 corpus for music separation, December 2017

Zafar Rafii, Antoine Liutkus, Fabian-Robert St \"o ter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, December 2017

work page 2017

[40] [40]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention--MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.\ 234--241. Springer, 2015

work page 2015

[41] [41]

Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey. Sdr – half-baked or well done? In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 626--630, 2019

work page 2019

[42] [42]

Diffusion-based generative speech source separation

Robin Scheibler, Youna Ji, Soo-Whan Chung, Jaeuk Byun, Soyeon Choe, and Min-Seok Choi. Diffusion-based generative speech source separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023

work page 2023

[43] [43]

Mo \^ u sai: Efficient text-to-music diffusion models

Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Sch \" o lkopf. Mo \^ u sai: Efficient text-to-music diffusion models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 , pp.\ 8050--8068. Association for Computational Linguistics, 2024

work page 2024

[44] [44]

Diffusion-based speech enhancement with joint generative and predictive decoders

Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, and Yuki Mitsufuji. Diffusion-based speech enhancement with joint generative and predictive decoders. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 12951--12955. IEEE, 2024

work page 2024

[45] [45]

Weiss, Niru Maheswaranathan, and Surya Ganguli

Jascha Sohl - Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis R. Bach and David M. Blei (eds.), Proceedings ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of JMLR Workshop and Conference Proceedings , pp.\ 2256--2265. JMLR.org, 2015

work page 2015

[46] [46]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020

work page 2020

[47] [47]

Improved techniques for training consistency models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. In ICLR, 2024

work page 2024

[48] [48]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, ...

work page 2019

[49] [49]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021

[50] [50]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research, pp.\ 32211--32252. PMLR , 2023

work page 2023

[51] [51]

Wave-u-net: A multi-scale neural network for end-to-end audio source separation

Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018 , pp.\ 334--340, 2018

work page 2018

[52] [52]

Generative adversarial source separation

Y Cem Subakan and Paris Smaragdis. Generative adversarial source separation. In Proc. ICASSP, pp.\ 26--30. IEEE, 2018

work page 2018

[53] [53]

Multi-scale multi-band densenets for audio source separation

Naoya Takahashi and Yuki Mitsufuji. Multi-scale multi-band densenets for audio source separation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2017, New Paltz, NY, USA, October 15-18, 2017 , pp.\ 21--25. IEEE , 2017

work page 2017

[54] [54]

D3net: Densely connected multidilated densenet for music source separation

Naoya Takahashi and Yuki Mitsufuji. D3net: Densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733, 2020

work page arXiv 2010

[55] [55]

Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation

Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji. Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation. In Proc. IWAENC, pp.\ 106--110, 2018. doi:10.1109/IWAENC.2018.8521383

work page doi:10.1109/iwaenc.2018.8521383 2018

[56] [56]

Deep neural network based instrument extraction from music

Stefan Uhlich, Franck Giron, and Yuki Mitsufuji. Deep neural network based instrument extraction from music. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015 , pp.\ 2135--2139. IEEE , 2015

work page 2015

[57] [57]

Improving music source separation based on deep neural networks through data augmentation and network blending

Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, Naoya Takahashi, and Yuki Mitsufuji. Improving music source separation based on deep neural networks through data augmentation and network blending. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017 ,...

work page 2017

[58] [58]

WaveNet: A Generative Model for Raw Audio

Aäron van den Oord , Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio . In Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), pp.\ 125, 2016

work page 2016

[59] [59]

Unsupervised sound separation using mixture invariant training

Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron Weiss, Kevin Wilson, and John Hershey. Unsupervised sound separation using mixture invariant training. Advances in Neural Information Processing Systems, 33: 0 3846--3857, 2020

work page 2020

[60] [60]

Zero-shot duet singing voices separation with diffusion models

Chin-Yun Yu, Emilian Postolache, Emanuele Rodolà, and György Fazekas. Zero-shot duet singing voices separation with diffusion models. arXiv:2311.07345, 2023

work page arXiv 2023

[61] [61]

Music source separation with generative flow

Ge Zhu, Jordan Darefsky, Fei Jiang, Anton Selitskiy, and Zhiyao Duan. Music source separation with generative flow. IEEE Signal Process. Lett. , 29: 0 2288--2292, 2022. doi:10.1109/LSP.2022.3219355

work page doi:10.1109/lsp.2022.3219355 2022

[62] [62]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[63] [63]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[64] [64]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page