Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Pith reviewed 2026-05-17 00:25 UTC · model grok-4.3
The pith
Decomposing audio aesthetics into four axes lets automatic models predict quality for speech, music, and sound at human-comparable levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce annotation guidelines that decompose human listening perspectives into four distinct axes and train no-reference per-item prediction models that assess audio aesthetic quality for speech, music, and sound, achieving performance comparable or superior to existing methods when measured against human mean opinion scores.
What carries the argument
The four-axis annotation guidelines that decompose subjective listening perspectives into separate components for training unified no-reference prediction models.
If this is right
- Enables scalable filtering and curation of large audio datasets without repeated human listening.
- Supports pseudo-labeling for training and improving generative audio models.
- Provides a single evaluation approach that covers speech, music, and general sound in one system.
- Allows consistent benchmarking of new generative models against a fixed automated scorer.
Where Pith is reading between the lines
- The models could be adapted for real-time quality monitoring in audio streaming or editing software.
- Cultural variation in aesthetics might require adding or weighting axes differently for global applications.
- Combining these predictions with other signals like content safety could improve automated audio moderation pipelines.
Load-bearing premise
The four-axis annotation guidelines sufficiently capture the subjective and culturally influenced nature of audio aesthetics for the tested domains and generalize to new data.
What would settle it
A new test set of audio samples drawn from different cultural contexts where human mean opinion scores show low correlation with the four-axis model predictions would indicate the guidelines fail to generalize.
read the original abstract
The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: https://github.com/facebookresearch/audiobox-aesthetics
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces four-axis annotation guidelines to decompose subjective audio aesthetics for speech, music, and sound into distinct perspectives. It trains no-reference per-item prediction models on these axes and reports that the resulting models achieve comparable or superior performance to human mean opinion scores (MOS) and prior methods. The work emphasizes applications in data filtering, pseudo-labeling, and generative model evaluation, and releases code, models, and datasets.
Significance. If the four-axis labels prove reliable and the performance gains hold under proper validation, the framework offers a practical, unified no-reference tool for audio quality assessment that could reduce reliance on costly human listening tests. The open-source release of models and data is a clear strength for reproducibility and community benchmarking.
major comments (2)
- [Annotation guidelines and data collection] The central performance claim rests on the four-axis guidelines producing consistent training targets. The manuscript does not report inter-annotator reliability statistics (e.g., ICC, Krippendorff’s alpha, or per-axis pairwise agreement) for the collected annotations. Without these metrics, it is impossible to assess label noise or stability across annotator pools, which directly affects whether the reported MOS correlations reflect genuine generalization or annotation artifacts.
- [Experiments and results] The evaluation section compares models to human MOS and baselines but provides no details on train/test splits, cross-domain testing, or statistical significance testing of the claimed improvements. This makes it difficult to determine whether the “comparable or superior” result is robust or sensitive to particular data partitions.
minor comments (1)
- [Abstract] The abstract refers to “four distinct axes” without naming them; explicitly listing the axes (e.g., in the introduction or guidelines section) would improve immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to incorporate the requested details on annotation reliability and experimental protocols.
read point-by-point responses
-
Referee: [Annotation guidelines and data collection] The central performance claim rests on the four-axis guidelines producing consistent training targets. The manuscript does not report inter-annotator reliability statistics (e.g., ICC, Krippendorff’s alpha, or per-axis pairwise agreement) for the collected annotations. Without these metrics, it is impossible to assess label noise or stability across annotator pools, which directly affects whether the reported MOS correlations reflect genuine generalization or annotation artifacts.
Authors: We agree that inter-annotator reliability metrics are necessary to substantiate the quality of the four-axis annotations. These statistics (including ICC and Krippendorff’s alpha per axis) were computed as part of our internal validation but omitted from the initial submission. The revised manuscript will include a new subsection under Data Collection that reports these values along with per-axis pairwise agreement rates, allowing readers to evaluate label consistency directly. revision: yes
-
Referee: [Experiments and results] The evaluation section compares models to human MOS and baselines but provides no details on train/test splits, cross-domain testing, or statistical significance testing of the claimed improvements. This makes it difficult to determine whether the “comparable or superior” result is robust or sensitive to particular data partitions.
Authors: We acknowledge the need for greater transparency in the experimental setup. The revised version will expand the Experiments section to explicitly describe the train/test split strategy (including proportions and any stratification by domain), detail the cross-domain testing protocol across speech, music, and general sound, and present statistical significance results (e.g., paired t-tests or Wilcoxon tests with p-values) for all reported improvements over baselines and human MOS. These additions will be supported by updated tables and text. revision: yes
Circularity Check
No circularity: performance claims grounded in external human MOS comparisons
full rationale
The paper introduces new four-axis annotation guidelines and trains no-reference models to predict aesthetic scores along those axes. Central claims rest on direct comparison of model outputs to human mean opinion scores (MOS) collected under the guidelines plus benchmarks against prior methods. This is standard external validation against human judgments rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, derivations, or citations in the provided text reduce the reported performance to the inputs by construction. The evaluation protocol remains falsifiable against held-out human data and existing baselines.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human listening perspectives on audio quality can be decomposed into four distinct, consistent axes that generalize across speech, music, and sound.
Forward citations
Cited by 18 Pith papers
-
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
-
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...
-
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
-
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
-
MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline
MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.
-
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
-
AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.
-
VABench: A Comprehensive Benchmark for Audio-Video Generation
VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.
-
FSD50K-Solo: Automated Curation of Single-Source Sound Events
The authors present a scalable curation method that combines diffusion-based mixture synthesis with a discriminative classifier to automatically extract single-source sound events from FSD50K and release the cleaned F...
-
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling
AuDirector is a self-reflective closed-loop multi-agent framework that generates immersive audio narratives with improved structural coherence, emotional expressiveness, and acoustic fidelity via identity-aware voice ...
-
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
-
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
-
APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music
APEX jointly predicts engagement-based popularity and five aesthetic quality dimensions for AI-generated music, improving human preference prediction on out-of-distribution generative systems.
-
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...
-
SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment
SongBench is a new fine-grained benchmark for song quality assessment with seven dimensions and an expert-annotated dataset of 11,717 samples showing high correlation with professional ratings.
-
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
-
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
-
Scaling Properties of Continuous Diffusion Spoken Language Models
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
Reference graph
Works this paper leans on
-
[1]
Davis and Paul Mermelstein , Journal =
Steven B. Davis and Paul Mermelstein , Journal =. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences , Volume =
- [2]
-
[3]
The Elements of Statistical Learning -- Data Mining, Inference, and Prediction , Year =
Trevor Hastie and Robert Tibshirani and Jerome Friedman , Publisher =. The Elements of Statistical Learning -- Data Mining, Inference, and Prediction , Year =
-
[4]
Jane Smith and Firstname2 Lastname2 and Firstname3 Lastname3 , Pages =. A really good paper about. Proc
-
[5]
An excellent paper introducing the
Robert Jones and Firstname2 Lastname2 and Firstname3 Lastname3 , Crossref =. An excellent paper introducing the
- [6]
-
[7]
IEEE Journal of Selected Topics in Signal Processing , volume=
Wavlm: Large-scale self-supervised pre-training for full stack speech processing , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=
work page 2022
-
[11]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[13]
Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors , author=. ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2021 , organization=
work page 2021
-
[14]
Kaito, Baba and Wataru, Nakata and Yuki, Saito and Hiroshi, Saruwatari , booktitle =. The T05 System for The
-
[15]
Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=
work page 2023
-
[17]
11th ISCA Speech Synthesis Workshop (SSW 11) , year=
How do Voices from Past Speech Synthesis Challenges Compare Today? , author=. 11th ISCA Speech Synthesis Workshop (SSW 11) , year=
-
[18]
The VoiceMOS Challenge 2022 , author=. Interspeech 2022 , year=
work page 2022
-
[19]
ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit , author=. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2020 , organization=
work page 2020
-
[20]
The blizzard challenge 2019 , author=. Proc. Blizzard Challenge Workshop , volume=
work page 2019
-
[21]
2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=
Audio set: An ontology and human-labeled dataset for audio events , author=. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2017 , organization=
work page 2017
-
[24]
Richter, Julius and Wu, Yi-Chiao and Krenn, Steven and Welker, Simon and Lay, Bunlong and Watanabe, Shinjii and Richard, Alexander and Gerkmann, Timo , booktitle=
-
[25]
Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis , author =. 2023 , booktitle =. doi:10.21437/Interspeech.2023-1905 , issn =
-
[27]
Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G. , title =. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) , pages =
work page 2020
-
[28]
Rix, A.W. and Beerends, J.G. and Hollier, M.P. and Hekstra, A.P. , booktitle=. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , year=
-
[29]
Journal of the Audio Engineering Society , volume=
Perceptual objective listening quality assessment (POLQA), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment , author=. Journal of the Audio Engineering Society , volume=. 2013 , publisher=
work page 2013
-
[32]
International conference on machine learning , pages=
Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[33]
IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP , year =
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , author =. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP , year =
-
[36]
2020 twelfth international conference on quality of multimedia experience (QoMEX) , pages=
ViSQOL v3: An open source production ready objective speech and audio metric , author=. 2020 twelfth international conference on quality of multimedia experience (QoMEX) , pages=. 2020 , organization=
work page 2020
-
[38]
International Telecommunications Union—Radiocommunication (ITU-T) , year=
Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models , author=. International Telecommunications Union—Radiocommunication (ITU-T) , year=
-
[42]
MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion , author=. Interspeech , year=
-
[43]
Generalization ability of MOS prediction networks , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=
work page 2022
-
[45]
2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=
LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement , author=. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2023 , organization=
work page 2023
-
[46]
LAION-AESTHETICS , author=
- [47]
-
[48]
NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets , author=. Interspeech 2021 , year=
work page 2021
-
[49]
International Telecommunications Union—Radiocommunication (ITU-T), 2001
Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models. International Telecommunications Union—Radiocommunication (ITU-T), 2001
work page 2001
-
[50]
MusicLM: Generating Music From Text
Andrea Agostinelli, Timo I Denk, Zal \'a n Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211--4215, 2020
work page 2020
-
[52]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arxiv 2016. arXiv preprint arXiv:1607.06450, 1, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[53]
John G Beerends, Christian Schmidmer, Jens Berger, Matthias Obermann, Raphael Ullmann, Joachim Pomy, and Michael Keyhl. Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment. Journal of the Audio Engineering Society, 61 0 (6): 0 366--384, 2013
work page 2013
-
[54]
Wavlm: Large-scale self-supervised pre-training for full stack speech processing
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505--1518, 2022
work page 2022
-
[55]
Visqol v3: An open source production ready objective speech and audio metric
Michael Chinen, Felicia SC Lim, Jan Skoglund, Nikita Gureev, Feargus O'Gorman, and Andrew Hines. Visqol v3: An open source production ready objective speech and audio metric. In 2020 twelfth international conference on quality of multimedia experience (QoMEX), pages 1--6. IEEE, 2020
work page 2020
-
[56]
Erica Cooper and Junichi Yamagishi. How do voices from past speech synthesis challenges compare today? In 11th ISCA Speech Synthesis Workshop (SSW 11). ISCA, 2021
work page 2021
-
[57]
Investigating range-equalizing bias in mean opinion score ratings of synthesized speech
Erica Cooper and Junichi Yamagishi. Investigating range-equalizing bias in mean opinion score ratings of synthesized speech. arXiv preprint arXiv:2305.10608, 2023
-
[58]
Generalization ability of mos prediction networks
Erica Cooper, Wen-Chin Huang, Tomoki Toda, and Junichi Yamagishi. Generalization ability of mos prediction networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8442--8446. IEEE, 2022
work page 2022
-
[59]
Pam: Prompting audio-language models for audio quality assessment
Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, and Huaming Wang. Pam: Prompting audio-language models for audio quality assessment. arXiv preprint arXiv:2402.00282, 2023
-
[60]
High Fidelity Neural Audio Compression
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM
Szu-Wei Fu, Yu Tsao, Hsin-Te Hwang, and Hsin-Min Wang. Quality-net: An end-to-end non-intrusive speech quality assessment model based on blstm. arXiv preprint arXiv:1808.05344, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[62]
Audio set: An ontology and human-labeled dataset for audio events
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776--780. IEEE, 2017
work page 2017
-
[63]
Espnet-tts: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit
Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, and Xu Tan. Espnet-tts: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 7654--7658. I...
work page 2020
-
[64]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[65]
Wen Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, and Junichi Yamagishi. The voicemos challenge 2022. Interspeech 2022, 2022
work page 2022
-
[66]
MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models
Wen-Chin Huang, Erica Cooper, and Tomoki Toda. Mos-bench: Benchmarking generalization abilities of subjective speech quality assessment models. arXiv preprint arXiv:2411.03715, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
The voicemos challenge 2024: Beyond speech quality prediction
Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, and Yu Tsao. The voicemos challenge 2024: Beyond speech quality prediction. arXiv preprint arXiv:2409.07001, 2024 b
-
[68]
Baba Kaito, Nakata Wataru, Saito Yuki, and Saruwatari Hiroshi. The t05 system for the V oice MOS C hallenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech. In IEEE Spoken Language Technology Workshop (SLT), 2024
work page 2024
-
[69]
Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms
Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Frechet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[70]
A udio C aps: Generating captions for audios in the wild
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. A udio C aps: Generating captions for audios in the wild. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pa...
-
[71]
Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio
Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, and Buye Xu. Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE, 2023
work page 2023
-
[72]
LAION. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/. Accesssed: 2024-12-06
work page 2024
-
[73]
AI at Meta Llama Team. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
Mosnet: Deep learning-based objective assessment for voice conversion
Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning-based objective assessment for voice conversion. Interspeech, 2019
work page 2019
-
[75]
Gabriel Mittag and Sebastian M \"o ller. Quality degradation diagnosis for voice networks-estimating the perceived noisiness, coloration, and discontinuity of transmitted speech. In INTERSPEECH, pages 3426--3430, 2019
work page 2019
-
[76]
Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian M \"o ller. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. Interspeech 2021, 2021
work page 2021
-
[77]
AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech
Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson, Rif A Saurous, and D Sculley. Automos: Learning a non-intrusive assessor of naturalness-of-speech. arXiv preprint arXiv:1611.09207, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[78]
Le-ssl-mos: Self-supervised learning mos prediction with listener enhancement
Zili Qi, Xinhui Hu, Wangjin Zhou, Sheng Li, Hao Wu, Jian Lu, and Xinkang Xu. Le-ssl-mos: Self-supervised learning mos prediction with listener enhancement. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1--6. IEEE, 2023
work page 2023
-
[79]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR, 2023
work page 2023
-
[80]
MUSDB18-HQ - an uncompressed version of musdb18,
Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. Musdb18-hq - an uncompressed version of musdb18, August 2019. https://doi.org/10.5281/zenodo.3338373
-
[81]
Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors
Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6493--6497. IEEE, 2021
work page 2021
-
[82]
EARS : An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation
Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinjii Watanabe, Alexander Richard, and Timo Gerkmann. EARS : An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation. In Interspeech, 2024
work page 2024
-
[83]
A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2001. doi:10.1109/ICASSP.2001.941023
-
[84]
Utmos: Utokyo-sarulab system for voicemos challenge 2022
Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152, 2022
-
[85]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://...
work page 2017
-
[86]
Audiobox: Unified audio generation with natural language prompts
Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821, 2023
-
[87]
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023
work page 2023
-
[88]
Zhizheng Wu, Zhihang Xie, and Simon King. The blizzard challenge 2019. In Proc. Blizzard Challenge Workshop, volume 2019, 2019
work page 2019
-
[89]
Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu
Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech. In Interspeech 2019, pages 1526--1530, 2019. doi:10.21437/Interspeech.2019-2441
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.