arxiv: 2604.09371 · v2 · submitted 2026-04-10 · 📡 eess.AS

Recognition: unknown

Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models

Chengwei Liu, Haoyin Yan, Hongyu Wang, Pengbo Lyu, Shaofei Xue, Xiangyu Zhao, Xiaotao Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:23 UTC · model grok-4.3

classification 📡 eess.AS

keywords music source separationdiscrete token modelinglanguage modelsgenerative audioneural audio codecMUSDB18-HQmulti-stem separationautoregressive generation

0 comments

The pith

A generative language model separates multi-stem music by autoregressively generating discrete audio tokens from the mixture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that music source separation can be reformulated as conditional generation of discrete tokens rather than direct estimation of continuous waveforms. A Conformer encoder conditions a decoder-only language model on the mixed input, while a dual-path neural audio codec converts the generated tokens back into independent stem waveforms. On the MUSDB18-HQ benchmark this approach reaches perceptual quality comparable to leading discriminative systems and records the highest NISQA score on the vocals track. The result matters because it demonstrates that autoregressive token prediction can serve as a viable alternative to regression-based separation without requiring explicit phase reconstruction.

Core claim

The central claim is that a framework combining a Conformer-based conditional encoder, the HCodec dual-path neural audio codec, and a decoder-only language model can autoregressively generate discrete tokens for four target tracks from a mixed input. The tokens are decoded to waveforms, and evaluation on MUSDB18-HQ shows perceptual quality approaching state-of-the-art discriminative methods while attaining the highest NISQA score on vocals. Ablation studies confirm the benefit of the learnable Conformer encoder and of sequential cross-track generation order.

What carries the argument

Autoregressive generation of discrete audio tokens by a decoder-only language model, conditioned on a Conformer-encoded mixture and produced via the HCodec neural codec.

Load-bearing premise

That an autoregressive language model given only the mixed input can faithfully recover independent source tracks without introducing significant cross-talk, phase inconsistencies, or perceptual artifacts.

What would settle it

A controlled listening test or metric evaluation on MUSDB18-HQ that finds audible instrument bleed, lower overall perceptual scores than current discriminative baselines, or failure to achieve the reported highest NISQA on vocals would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.09371 by Chengwei Liu, Haoyin Yan, Hongyu Wang, Pengbo Lyu, Shaofei Xue, Xiangyu Zhao, Xiaotao Liang.

**Figure 1.** Figure 1: Architecture of the proposed token-based generative source separation model. The Mixture Embedding Em is conditioned by the Decoder-Only LM, which autoregressively generates discrete audio tokens. mixture representation. The design of each component is detailed in Sections 2.2–2.4, and the training and inference procedures are described in Section 2.5. 2.2. Conditional Feature Extraction The conditional … view at source ↗

**Figure 2.** Figure 2: Qualitative spectrogram visualization of source separation results for the vocals track. 3. Experimental Setup 3.1. Dataset and Data Augmentation We train the proposed model on a large-scale internal music dataset with approximately 23,000 hours of 44.1 kHz audio. The dataset includes songs, audiobooks, and instrumental tracks. Since the original recordings do not provide isolated track annotations, we a… view at source ↗

read the original abstract

We propose a generative framework for multi-track music source separation (MSS) that reformulates the task as conditional discrete token generation. Unlike conventional approaches that directly estimate continuous signals in the time or frequency domain, our method combines a Conformer-based conditional encoder, a dual-path neural audio codec (HCodec), and a decoder-only language model to autoregressively generate audio tokens for four target tracks. The generated tokens are decoded back to waveforms through the codec decoder. Evaluation on the MUSDB18-HQ benchmark shows that our generative approach achieves perceptual quality approaching state-of-the-art discriminative methods, while attaining the highest NISQA score on the vocals track. Ablation studies confirm the effectiveness of the learnable Conformer encoder and the benefit of sequential cross-track generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes multi-stem separation as sequential discrete-token generation with a decoder-only LM and gets competitive perceptual scores on MUSDB18-HQ, but the autoregressive setup leaves room for cross-track leakage that standard metrics may miss.

read the letter

The main point here is that this work turns four-stem music separation into an autoregressive token prediction problem: a Conformer encodes the mixture, HCodec supplies the discrete vocabulary, and a decoder-only LM generates the stems one after another. On MUSDB18-HQ it reaches perceptual quality close to strong discriminative baselines and posts the highest NISQA on vocals. That combination of components for cross-track sequential generation is not a standard move in the cited discrete-audio literature, so the pipeline itself counts as the concrete novelty. The ablations on the learnable encoder and on sequential versus independent generation are useful and directly support the design choices. The benchmark numbers are reported clearly enough to show the approach is viable in practice. The results rest on external comparisons rather than any circular self-reference, which keeps the claims grounded. The soft spot is exactly the one the stress test flags. Sequential generation can let early-track errors or quantization artifacts bleed into later stems, and nothing in the abstract indicates a mixture-consistency loss or post-processing step to enforce additivity. HCodec already drops continuous phase information, so phase drift or subtle crosstalk could appear without hurting NISQA or similar perceptual scores. If the full paper only shows the same ablations without extra diagnostics on leakage or reconstruction error when stems are summed, that remains a real but not fatal limitation for a generative method. This is aimed at people already working on generative audio models or music source separation who want to see how language-model techniques transfer to the task. It has enough empirical grounding and clear ablations to deserve referee time rather than a desk reject, even if reviewers will likely press on consistency checks and statistical detail.

Referee Report

3 major / 2 minor

Summary. The paper proposes a generative framework for multi-stem music source separation that reformulates the task as conditional discrete token generation. It combines a Conformer-based conditional encoder, a dual-path neural audio codec (HCodec), and a decoder-only language model to autoregressively generate HCodec tokens for four target stems (vocals, drums, bass, other) in sequence from the mixed input; the tokens are then decoded to waveforms. On the MUSDB18-HQ benchmark the method is reported to achieve perceptual quality approaching state-of-the-art discriminative approaches while attaining the highest NISQA score on the vocals track; ablations are said to confirm the value of the learnable Conformer encoder and of sequential cross-track generation.

Significance. If the benchmark results hold under rigorous verification, the work would demonstrate that decoder-only language models operating on discrete audio tokens can be competitive with established discriminative methods for multi-stem separation. The reformulation as autoregressive token modeling introduces a new modeling paradigm that could enable better long-range dependency capture and perceptual quality, as hinted by the NISQA result on vocals. The approach also supplies a concrete example of how neural audio codecs can be integrated with language-model-style generation for audio tasks.

major comments (3)

[Abstract / ablation studies] Abstract and ablation studies section: the claim that sequential cross-track generation is beneficial is not accompanied by quantitative measurements of cross-stem leakage, phase drift, or mixture inconsistency. Because the decoder-only LM generates stems sequentially without any described mixture-consistency loss or bidirectional conditioning, early-track errors can propagate; the absence of such diagnostics is load-bearing for the central claim that the generative outputs faithfully recover independent sources.
[Evaluation on MUSDB18-HQ] Evaluation section: the reported highest NISQA score on vocals and the statement that perceptual quality 'approaches' SOTA discriminative methods are presented without standard deviations, statistical significance tests, or per-stem SI-SDR / SDR comparisons against the strongest baselines. Without these, it is impossible to determine whether the generative approach is statistically competitive or merely within the variance of existing methods.
[Proposed framework] Architecture description: the framework encodes the mixture with a Conformer and then autoregressively emits HCodec tokens for each stem independently; no post-processing or consistency term is described to enforce that the sum of the four decoded waveforms equals the input mixture. Given that HCodec quantization already discards continuous phase information, the lack of an explicit consistency mechanism risks audible artifacts that standard perceptual metrics such as NISQA may not penalize.

minor comments (2)

[Abstract] The abstract states that the method attains the 'highest NISQA score on the vocals track' but does not specify the exact numerical value or the competing systems; adding the concrete numbers would improve readability.
[Methods] Notation for the four stems and the ordering in which they are generated sequentially should be defined explicitly in the methods section to avoid ambiguity when discussing cross-track conditioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment point by point below, providing the strongest honest defense of the manuscript while indicating where revisions will be made to strengthen the work.

read point-by-point responses

Referee: [Abstract / ablation studies] Abstract and ablation studies section: the claim that sequential cross-track generation is beneficial is not accompanied by quantitative measurements of cross-stem leakage, phase drift, or mixture inconsistency. Because the decoder-only LM generates stems sequentially without any described mixture-consistency loss or bidirectional conditioning, early-track errors can propagate; the absence of such diagnostics is load-bearing for the central claim that the generative outputs faithfully recover independent sources.

Authors: We acknowledge that explicit quantitative diagnostics for cross-stem leakage, phase drift, and mixture inconsistency would provide stronger support for the ablation claim. Our existing ablations demonstrate that sequential cross-track generation improves perceptual metrics (including NISQA) over independent generation, indicating that the decoder-only LM effectively captures inter-stem dependencies. In the revised manuscript, we will add direct measurements such as the L2 residual between the input mixture and the sum of decoded stems, as well as average cross-stem correlation coefficients, to quantify leakage and consistency. This will address the concern about error propagation while preserving the generative modeling paradigm. revision: yes
Referee: [Evaluation on MUSDB18-HQ] Evaluation section: the reported highest NISQA score on vocals and the statement that perceptual quality 'approaches' SOTA discriminative methods are presented without standard deviations, statistical significance tests, or per-stem SI-SDR / SDR comparisons against the strongest baselines. Without these, it is impossible to determine whether the generative approach is statistically competitive or merely within the variance of existing methods.

Authors: We agree that reporting variability and direct statistical comparisons would make the evaluation more rigorous and allow clearer assessment of competitiveness. In the revised version, we will include standard deviations (computed across inference seeds or available data partitions), full per-stem SI-SDR and SDR tables against the strongest baselines (e.g., HTDemucs), and appropriate statistical significance tests such as paired Wilcoxon tests on the MUSDB18-HQ results. These additions will clarify whether the observed NISQA gains and perceptual quality are statistically meaningful. revision: yes
Referee: [Proposed framework] Architecture description: the framework encodes the mixture with a Conformer and then autoregressively emits HCodec tokens for each stem independently; no post-processing or consistency term is described to enforce that the sum of the four decoded waveforms equals the input mixture. Given that HCodec quantization already discards continuous phase information, the lack of an explicit consistency mechanism risks audible artifacts that standard perceptual metrics such as NISQA may not penalize.

Authors: The design choice to omit an explicit consistency loss is deliberate: the decoder-only LM is trained end-to-end to generate tokens conditioned on the mixture via the Conformer encoder, allowing it to learn high-fidelity, perceptually natural reconstructions without the over-smoothing often induced by additive constraints in discriminative models. The dual-path HCodec further aids phase preservation. That said, we recognize the referee's valid point regarding potential artifacts from quantization. In the revision, we will add a quantitative analysis of mixture inconsistency error and explore a lightweight post-processing consistency step (e.g., a simple projection) if it improves results without harming perceptual quality. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on external benchmark evaluation and architectural ablations

full rationale

The paper presents an empirical modeling approach that reformulates multi-stem source separation as autoregressive discrete token generation using a Conformer encoder, HCodec, and decoder-only LM. Performance is assessed via standard metrics (NISQA, perceptual quality) on the held-out MUSDB18-HQ benchmark against external SOTA discriminative baselines. No derivation chain, equation, or first-principles result is claimed; ablations simply test the contribution of the learnable encoder and sequential conditioning without reducing any output to a fitted quantity defined by the same model. The central claims are therefore falsifiable against independent data and do not loop back to the model's own inputs or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard neural-network components and the HCodec codec.

pith-pipeline@v0.9.0 · 5444 in / 968 out tokens · 48484 ms · 2026-05-10T16:23:02.699704+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 11 canonical work pages · 3 internal anchors

[1]

This task is central to applica- tions such as music remixing, transcription, karaoke generation, and enhancing the experience of hearing-impaired individuals

Introduction Music Source Separation (MSS) aims to decompose a mixture of audio signals into individual sources, such as vocals, drums, bass, and other instruments. This task is central to applica- tions such as music remixing, transcription, karaoke generation, and enhancing the experience of hearing-impaired individuals. Current mainstream MSS methods p...
[2]

Method 2.1. Overall Architecture Given a mixture waveformx mix ∈R T sampled at 48 kHz, the framework predicts discrete token sequences for four target tracks{x s}s∈S, whereS={vocals,drums,bass,other}, and reconstructs the separated waveforms via a neural codec decoder. As illustrated in Figure 1, the proposed framework consists of three components: (1) a ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Dataset and Data Augmentation We train the proposed model on a large-scale internal mu- sic dataset with approximately 23,000 hours of 44.1 kHz au- dio

Experimental Setup 3.1. Dataset and Data Augmentation We train the proposed model on a large-scale internal mu- sic dataset with approximately 23,000 hours of 44.1 kHz au- dio. The dataset includes songs, audiobooks, and instrumental tracks. Since the original recordings do not provide isolated track annotations, we apply BS-RoFormer [11], a state-of-the-...

2048
[4]

Table 2:Evaluation of separated vocals track quality

Results We compare our generative approach (denoted ”G”) against three discriminative baselines (denoted ”D”): HTDemucs [10], a hybrid time-frequency model; BS-RoFormer [11] and SCNet [2], both frequency-domain mask estimation methods. Table 2:Evaluation of separated vocals track quality. Model Type SIG BAK OVRL NISQA HTDemucs4 D 2.71 3.22 2.25 2.19 BS-Ro...
[5]

Limitations Our approach faces several limitations

Discussion 5.1. Limitations Our approach faces several limitations. First, the autoregressive generation paradigm struggles with percussive sources charac- terized by sharp transients, as evidenced by the performance gap on the drums track (3.44 vs. 3.77 to 3.88 for discriminative baselines). This may be due to the sequential nature of token- by-token gen...
[6]

Conclusion We presented a conditional discrete-generation framework for multi-track music source separation that reformulates the task as autoregressive token prediction. Experiments on MUSDB18- HQ validate that this generative paradigm can approach the per- ceptual quality of established discriminative methods, and even surpass them on the vocals track a...
[7]

All technical con- tent, experimental design, and scientific conclusions are solely the work of the authors

Generative AI Use Disclosure Generative AI tools were used to assist with English language editing and proofreading of this manuscript. All technical con- tent, experimental design, and scientific conclusions are solely the work of the authors
[8]

Improving music source separation based on deep neural networks through data augmentation and network blending,

S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y . Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending,” in2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 261– 265

2017
[9]

Scnet: Sparse compression network for music source separation,

W. Tong, J. Zhu, J. Chen, S. Kang, T. Jiang, Y . Li, Z. Wu, and H. Meng, “Scnet: Sparse compression network for music source separation,” inICASSP 2024-2024 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1276–1280

2024
[10]

Hybrid spectrogram and waveform source separa- tion,

A. D ´efossez, “Hybrid spectrogram and waveform source separa- tion,”arXiv preprint arXiv:2111.03600, 2021

work page arXiv 2021
[11]

Kuielab-mdx- net: A two-stream neural network for music demixing,

M. Kim, W. Choi, J. Chung, D. Lee, and S. Jung, “Kuielab-mdx- net: A two-stream neural network for music demixing,”arXiv preprint arXiv:2111.12203, 2021

work page arXiv 2021
[12]

Music demixing challenge 2021,

Y . Mitsufuji, G. Fabbro, S. Uhlich, F.-R. St ¨oter, A. D ´efossez, M. Kim, W. Choi, C.-Y . Yu, and K.-W. Cheuk, “Music demixing challenge 2021,”Frontiers in Signal Processing, vol. 1, p. 808395, 2022

2021
[13]

The sound demixing challenge 2023 – music demixing track,

G. Fabbro, S. Uhlich, C.-H. Lai, W. Choi, M. Mart ´ınez-Ram´ırez, W. Liao, I. Gadelha, G. Ramos, E. Hsu, H. Rodrigues, F.-R. St¨oter, A. D ´efossez, Y . Luo, J. Yu, D. Chakraborty, S. Mohanty, R. Solovyev, A. Stempkovskiy, T. Habruseva, N. Goswami, T. Harada, M. Kim, J. Hyung Lee, Y . Dong, X. Zhang, J. Liu, and Y . Mitsufuji, “The sound demixing challeng...

work page doi:10.5334/tismir.171 2023
[14]

Monoaural audio source separation using deep convolutional neural networks,

P. Chandna, M. Miron, J. Janer, and E. G ´omez, “Monoaural audio source separation using deep convolutional neural networks,” in International conference on latent variable analysis and signal separation. Springer, 2017, pp. 258–266

2017
[15]

Music source separation with band-split rnn,

Y . Luo and J. Yu, “Music source separation with band-split rnn,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 31, pp. 1893–1901, 2023

1901
[16]

Music source separation in the waveform domain

A. D ´efossez, N. Usunier, L. Bottou, and F. Bach, “Music source separation in the waveform domain,”arXiv preprint arXiv:1911.13254, 2019

work page arXiv 1911
[17]

Hybrid transformers for music source separation,

S. Rouard, F. Massa, and A. D ´efossez, “Hybrid transformers for music source separation,” inICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[18]

Music source separation with band-split rope transformer,

W.-T. Lu, J.-C. Wang, Q. Kong, and Y .-N. Hung, “Music source separation with band-split rope transformer,” inICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 481–485

2024
[19]

Soundstream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2021

2021
[20]

High Fidelity Neural Audio Compression

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review arXiv 2022
[21]

Audiolm: a language modeling approach to audio gener- ation,

Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi et al., “Audiolm: a language modeling approach to audio gener- ation,”IEEE/ACM transactions on audio, speech, and language processing, vol. 31, pp. 2523–2533, 2023

2023
[22]

Neural codec language models are zero-shot text to speech synthesizers,

S. Chen, C. Wang, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 705–718, 2025

2025
[23]

Tokensplit: Using dis- crete speech representations for direct, refined, and transcript- conditioned speech separation and recognition,

H. Erdogan, S. Wisdom, X. Chang, Z. Borsos, M. Tagliasac- chi, N. Zeghidour, and J. R. Hershey, “Tokensplit: Using dis- crete speech representations for direct, refined, and transcript- conditioned speech separation and recognition,”arXiv preprint arXiv:2308.10415, 2023

work page arXiv 2023
[24]

Tselm: Target speaker extraction using discrete tokens and language models,

B. Tang, B. Zeng, and M. Li, “Tselm: Target speaker extraction using discrete tokens and language models,” inNational Confer- ence on Man-Machine Speech Communication. Springer, 2025, pp. 459–469

2025
[25]

Unisep: Universal target audio separation with language models at scale,

Y . Wang, H. Chen, D. Yang, W. Li, D. Luo, G. Li, S. Yang, Z. Wu, H. Meng, and X. Wu, “Unisep: Universal target audio separation with language models at scale,” in2025 IEEE International Con- ference on Multimedia and Expo (ICME). IEEE, 2025, pp. 1–6

2025
[26]

Unitok-audio: A unified audio generation framework via generative modeling on discrete codec tokens,

C. Liu, H. Yan, S. Xue, X. Liang, Y . Liu, Z. Xue, G. Song, and B. Zhou, “Unitok-audio: A unified audio generation framework via generative modeling on discrete codec tokens,”arXiv preprint arXiv:2510.26372, 2025

work page arXiv 2025
[27]

Quarkaudio technical report,

C. Liu, H. Yan, S. Xue, X. Liang, X. Chen, B. Gong, Z. Xue, and G. Song, “Quarkaudio technical report,”arXiv preprint arXiv:2512.20151, 2025

work page arXiv 2025
[28]

Musdb18-hq - an uncompressed version of musdb18,

Z. Rafii, A. Liutkus, F.-R. St ¨oter, S. I. Mimilakis, and R. Bittner, “Musdb18-hq - an uncompressed version of musdb18,” Aug. 2019. [Online]. Available: https://doi.org/10. 5281/zenodo.3338373

2019
[29]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

2021
[30]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Silero vad: Pre-trained enterprise-grade voice ac- tivity detector (vad), number detector and language classifier,

Silero Team, “Silero vad: Pre-trained enterprise-grade voice ac- tivity detector (vad), number detector and language classifier,” https://github.com/snakers4/silero-vad, 2024

2024
[32]

Visqol: an objective speech quality model,

A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “Visqol: an objective speech quality model,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, p. 13, 2015

2015
[33]

Visqol v3: An open source production ready objec- tive speech and audio metric,

M. Chinen, F. S. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “Visqol v3: An open source production ready objec- tive speech and audio metric,” in2020 twelfth international con- ference on quality of multimedia experience (QoMEX). IEEE, 2020, pp. 1–6

2020
[34]

Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 886–890

2022
[35]

Nisqa: A deep cnn- self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,”arXiv preprint arXiv:2104.09494, 2021

work page arXiv 2021