hub

Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound.arXiv preprint arXiv:2502.05139

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al · 2025 · arXiv 2502.05139

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

cs.SD · 2026-05-03 · unverdicted · novelty 7.0

TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rhythm while a new baseline performs competitively.

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.

VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

cs.SD · 2026-04-12 · unverdicted · novelty 7.0

VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.

AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

cs.SD · 2026-05-12 · unverdicted · novelty 6.0

AuDirector is a self-reflective closed-loop multi-agent framework that generates immersive audio narratives with improved structural coherence, emotional expressiveness, and acoustic fidelity via identity-aware voice adaptation and iterative correction.

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

cs.SD · 2026-05-06 · unverdicted · novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

eess.AS · 2026-05-06 · unverdicted · novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

cs.SD · 2026-05-05 · unverdicted · novelty 6.0

APEX jointly predicts engagement-based popularity and five aesthetic quality dimensions for AI-generated music, improving human preference prediction on out-of-distribution generative systems.

OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human perception.

SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment

eess.AS · 2026-04-16 · unverdicted · novelty 6.0

SongBench is a new fine-grained benchmark for song quality assessment with seven dimensions and an expert-annotated dataset of 11,717 samples showing high correlation with professional ratings.

Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio expressiveness on EchoMind after training on 800 hours of data.

Scaling Properties of Continuous Diffusion Spoken Language Models

cs.CL · 2026-04-27 · unverdicted · novelty 5.0

Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

citing papers explorer

Showing 12 of 12 citing papers.

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation cs.CV · 2026-05-12 · unverdicted · none · ref 17
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation cs.SD · 2026-05-03 · unverdicted · none · ref 15
TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rhythm while a new baseline performs competitively.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation cs.CV · 2026-04-26 · unverdicted · none · ref 40
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories cs.SD · 2026-04-12 · unverdicted · none · ref 50
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling cs.SD · 2026-05-12 · unverdicted · none · ref 30
AuDirector is a self-reflective closed-loop multi-agent framework that generates immersive audio narratives with improved structural coherence, emotional expressiveness, and acoustic fidelity via identity-aware voice adaptation and iterative correction.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models cs.SD · 2026-05-06 · unverdicted · none · ref 32
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions eess.AS · 2026-05-06 · unverdicted · none · ref 13
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music cs.SD · 2026-05-05 · unverdicted · none · ref 11
APEX jointly predicts engagement-based popularity and five aesthetic quality dimensions for AI-generated music, improving human preference prediction on out-of-distribution generative systems.
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 43
OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human perception.
SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment eess.AS · 2026-04-16 · unverdicted · none · ref 19
SongBench is a new fine-grained benchmark for song quality assessment with seven dimensions and an expert-annotated dataset of 11,717 samples showing high correlation with professional ratings.
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation cs.CL · 2026-04-13 · unverdicted · none · ref 22
SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio expressiveness on EchoMind after training on 800 hours of data.
Scaling Properties of Continuous Diffusion Spoken Language Models cs.CL · 2026-04-27 · unverdicted · none · ref 68
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound.arXiv preprint arXiv:2502.05139

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer