cs.SD — Pith

0

cs.SD 2026-05-13 Recognition

Hybrid Whisper model detects speaker confidence at 0.75 Macro-F1

A Semi-Supervised Framework for Speech Confidence Detection using Whisper

Adding prosodic features and careful pseudo-labels beats pure self-supervised audio models on scarce labelled data.

abstract click to expand

Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector composed of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. To mitigate reliance on scarce ground truth data, we introduce an Uncertainty-Aware Pseudo-Labelling strategy where a model generates labels for unlabelled data, retaining only high-quality samples for training. Experimental results demonstrate that the proposed approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines, including WavLM, HuBERT, and Wav2Vec 2.0. The hybrid architecture also surpasses the unimodal Whisper baseline, yielding a 3\% improvement in the minority class, confirming that explicit prosodic and auxiliary features provide necessary corrective signals which are otherwise lost in deep semantic representations. Ablation studies further show that a curated set of high confidence pseudo-labels outperforms indiscriminate large scale augmentation, confirming that data quality outweighs quantity for perceived confidence detection.

0

cs.SD 2026-05-13 Recognition

Singing conversion works on accompanied tracks without vocal cleanup

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

CQT extraction keeps lead melody plus residual harmonies; a flow-matching decoder rebuilds natural polyphonic output in the target voice.

abstract click to expand

Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.

1 0

0

cs.SD 2026-05-13 Recognition

STRUM turns raw audio into playable rhythm charts at 0.84 F1 for drums

STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

Hybrid neural pipeline handles drums guitar bass vocals and keys without metadata or manual timing fixes.

abstract click to expand

We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio-to-chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi-stage hybrid: a two-stage CRNN onset detector and a six-model ensemble classifier for drums; neural onset detectors with monophonic pitch tracking for guitar and bass; word-aligned ASR for vocals; and spectral keyboard detection for keys. We evaluate on a 30-song in-envelope benchmark constructed by screening candidate songs on a single audio-quality criterion -- the median 1-second drum-stem RMS after htdemucs_6s source separation. On this benchmark STRUM achieves drums onset F1 = 0.838, bass F1 = 0.694, guitar F1 = 0.651, and vocals F1 = 0.539 at a +/- 100 ms tolerance with per-song global offset search. We report a complete ablation of seven drum-pipeline components with paired per-song Wilcoxon tests, an analysis of ground-truth-to-audio timing distributions in community Clone Hero charts, and a per-class confusion matrix for the drum classifier. Code, model weights, and the full benchmark manifest are released.

0

cs.SD 2026-05-13 2 theorems

Closed-loop AI boosts quality of long audio stories

AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

Identity checks, automatic fixes, and natural language feedback help the system outperform baselines on coherence and expressiveness.

abstract click to expand

Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human interactivity. To address these challenges, we propose AuDirector, a self-reflective closed-loop multi-agent framework. Specifically, it involves an Identity-Aware Pre-production mechanism that transforms narrative texts into character profiles and utterance-level emotional instructions to retrieve suitable voice candidates and guide expressive speech synthesis, thereby promoting context-aligned voice adaptation. To enhance quality, a Collaborative Synthesis and Correction module introduces a closed-loop self-correction mechanism to systematically audit and regenerate defective audio components. Furthermore, a Human-Guided Interactive Refinement module facilitates user control by interpreting natural language feedback to interactively refine the underlying scripts. Experiments demonstrate that AuDirector achieves superior performance compared to state-of-the-art baselines in structural coherence, emotional expressiveness, and acoustic fidelity. Audio samples can be found at https://anonymous-itsh.github.io/.

1 0

0

cs.SD 2026-05-12 2 theorems

Token swaps edit speaker identity in compressed audio

Exploring Token-Space Manipulation in Latent Audio Tokenizers

A tokenizer that keeps only fixed learnable global tokens maintains low-bitrate reconstruction while allowing simple unsupervised changes to

abstract click to expand

Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task-specific editing models.

0

cs.SD 2026-05-12 2 theorems

Codec keeps emotional cues in compressed speech

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

Emotion-semantic guidance during quantization improves expressiveness for downstream speech tasks while holding content steady.

abstract click to expand

Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

0

cs.SD 2026-05-12 2 theorems

Multi-layer attention probes improve bioacoustic encoder evaluation

Multi-layer attentive probing improves transfer of audio representations for bioacoustics

Standard last-layer linear probes can underestimate how well pretrained audio models perform on nature-sound tasks

abstract click to expand

Probing heads map the representations learned from audio by a machine learning model to downstream task labels and are a key component in evaluating representation learning. Most bioacoustic benchmarks use a fixed, low-capacity probe, such as a linear layer on the final encoder layer. While this standardization enables model comparisons, it may bias results by overlooking the interaction between encoder features and probe design. In this work, we systematically study different probing strategies across two bioacoustic benchmarks, BEANs and BirdSet. We evaluate last- and multi-layer probing, across linear and attention probes. We show that larger probe heads that leverage time information have superior performance. Our results suggest that current benchmarks may misrepresent encoder quality when relying on a last-layer probing setup. Multi-layer probing improves downstream task performance across all tested models, while attention probing has superior performance to linear probing for transformer models.

0

cs.SD 2026-05-12 Recognition

Transformer predicts codec tokens to synthesize drums from MIDI grids

Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs

Expressive drum grids are mapped to neural audio code sequences and decoded into waveforms, tested across multiple tokenizers on human data.

abstract click to expand

Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.

0

cs.SD 2026-05-12 1 theorem

Cold diffusion cleans reverb from drum signals

A Cold Diffusion Approach for Percussive Dereverberation

Modeling reverberation as deterministic degradation recovers sharp transients better than baselines on in- and out-of-domain tests.

abstract click to expand

Most recent advances in audio dereverberation focus almost exclusively on speech, leaving percussive and drum signals largely unexplored despite their importance in music production. Percussive dereverberation poses distinct challenges due to sharp transients and dense temporal structure. In this work, we propose a cold diffusion framework for dereverberating stereo drum stems (downmixes), modeling reverberation as a deterministic degradation process that progressively transforms anechoic signals into reverberant ones. We investigate two reverse-process parameterizations, Direct (next-state) and a Delta-normalized residual (velocity-style) prediction, and implement the framework using both a UNet and a diffusion Transformer backbone. The models are trained and evaluated on curated datasets comprising both acoustic and electronic drum recordings, with reverberation generated using a combination of synthetic and real room impulse responses. Extensive experiments on in-domain and fully out-of-domain test sets demonstrate that the proposed method consistently outperforms strong score-based and conditional diffusion baselines, evaluated using signal-based and perceptual metrics tailored to percussive audio.

0

cs.SD 2026-05-12 2 theorems

Acoustic priors sharpen timbre edits in polyphonic music

Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

Polyphonia raises target stem alignment by 15.5% in zero-shot transfers while keeping background intact

abstract click to expand

The advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve stem-specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real-world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross-attention captures semantic features of stems, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a zero-shot editing framework with Acoustic-Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse boundaries, enabling non-target stems preserved precise semantic synthesis. For evaluation, we propose PolyEvalPrompts, a standardized prompt set with 1,170 timbre transfer tasks in polyphonic music. Specifically, Polyphonia achieves an increase of 15.5% in target alignment compared to baselines, while maintaining competitive music fidelity and non-target integrity.

0

cs.SD 2026-05-12 3 theorems

APEX explains audio classifiers with four prototype views

APEX: Audio Prototype EXplanations for Classification Tasks

Post-hoc method creates intuitive example-based explanations for sound models by separating time and frequency aspects without retraining.

abstract click to expand

Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. While prototype reasoning is promising, acoustic similarity remains multidimensional. We introduce APEX (Audio Prototype EXplanations), a post-hoc framework for interpreting pre-trained audio classifiers. Crucially, APEX requires no fine-tuning of the original backbone and strictly preserves output invariance. APEX disentangles explanations into four perspectives: Square-based prototypes to localize transient events, Time-based for temporal patterns, Frequency-based highlighting spectral bands, and Time-Frequency-based integrating both. This yields intuitive, example-based explanations that respect acoustic properties, providing greater semantic clarity than standard gradient-based methods.

1 0

0

cs.SD 2026-05-12 1 theorem

System turns Chladni patterns into real-time sound with 99% accuracy

ChladniSonify: A Visual-Acoustic Mapping Method for Chladni Patterns in New Media Art Creation

New method classifies vibration visuals and maps them to exact frequencies in under 50 milliseconds for interactive art.

abstract click to expand

In new media art creation, the mapping between vision and hearing is often subjective. As a classic carrier of sound visualization, Chladni patterns have great potential in building audio-visual mapping mechanisms. However, existing tools face pain points: high technical barriers for simulation, offline computing failing real-time interaction, and uncontrollable mapping rules in general sonification tools. To address these, this paper proposes ChladniSonify, a real-time visual-acoustic mapping method for Chladni patterns. Based on Kirchhoff-Love plate theory, we build a paired dataset via numerical programming and calibrate it using ANSYS finite element simulation. Focusing on the slender nodal lines of Chladni patterns, we adopt a lightweight CNN with CBAM to achieve high-precision, low-latency pattern classification. Finally, we build an end-to-end system in Python and Max/MSP, mapping recognized patterns to corresponding sine wave frequencies. Results show the system has excellent usability: the classification module achieves 99.33% accuracy on the test set with 7.03 ms inference latency; the mapped frequency matches the theoretical value with zero deviation; the average end-to-end latency is under 50 ms, meeting real-time interactive needs. This work provides a reproducible engineering prototype for Chladni audio-visual art creation.

0

cs.SD 2026-05-11 2 theorems

Joint diffusion remixes timbres across all stems in a mixture at once

Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

MixtureTT avoids separation errors, reduces cost by the stem count, and improves coherence by modeling cross-stem harmonics together on SATB

abstract click to expand

Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipelines that propagate source separation artifacts and produce incoherent synthesized timbres across stems. This paper proposes MixtureTT, to the best of our knowledge the first system for flexible per-stem timbre transfer directly from a polyphonic mixture. Given a mixture and a separate timbre reference for each target voice, MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by a factor equal to the number of stems, and yields more coherent multi-stem outputs. Despite operating under a strictly harder input condition, evaluations on the SATB choral dataset show that MixtureTT outperforms single-instrument baselines on both objective and subjective metrics demonstrating the necessity of dedicated multi-instrument timbre transfer over the naive separate-then-transfer pipelines. As a result, this work confirms that the cross-stem modeling is essential for mixture-level timbre transfer as the proposed joint setting consistently exceeds an equivalent single-stem ablation.

1 0

0

cs.SD 2026-05-11 1 theorem

Per-gender thresholds cut deepfake detector bias by 54-75%

Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias

Diagnosis shows sources are acoustic differences and feature leakage, not data imbalance, so targeted fixes preserve accuracy.

abstract click to expand

Audio deepfake detection systems are increasingly deployed in high-stakes security applications, yet their fairness across demographic groups remains critically underexamined. Prior work measures gender disparity but does not investigate where it comes from or how to fix it systematically. We present the first diagnosis-first framework that identifies bias source before applying targeted mitigation, evaluated on two models, AASIST and Wav2Vec2+ResNet18, on ASVSpoof5. Our diagnosis shows that bias does not stem from imbalanced training data but from acoustic representation differences, gender leakage in learned features, and structural evaluation asymmetry. We test mitigation strategies across in-processing, post-processing and combined families, including novel methods introduced in this work. Adjusting the decision threshold separately per gender reduces unfairness by 54% to 75% at no cost to detection accuracy, and our new epoch-level fairness regularisation method outperforms existing per-batch approaches. Adversarial debiasing succeeds only when gender leakage is localised, and fails when it is diffuse, an outcome correctly predicted by our diagnosis before training. No single method fully closes the fairness gap, confirming that bias sources must be identified before fixes are applied and that fairer benchmark design is equally important

0

cs.SD 2026-05-11 Recognition

Audio-first search benchmark caps top model at 43%

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

Models must extract clues from sound then retrieve and reason across text, images and video to reach verifiable answers

abstract click to expand

Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-DeepSearch}, a benchmark for audio-driven omni-modal deep search. Given one or more audio clips and a related question, models must infer useful clues from audio, invoke text, image, and video search tools, and perform multi-hop reasoning to produce a short, objective, and verifiable answer. Omni-DeepSearch contains 640 samples across 15 fine-grained categories, covering four retrieval target modalities and four audio content types. A multi-stage filtering pipeline ensures audio dependence, retrieval necessity, visual modality necessity, and answer uniqueness. Experiments on recent closed-source and open-source omni-modal models show that this task remains highly challenging: the strongest evaluated model, Gemini-3-Pro, achieves only 43.44\% average accuracy. Further analyses illustrate key bottlenecks in audio entity inference, query formulation, tool-use reliability, multi-hop retrieval, and cross-modal verification. These results highlight audio-driven omni-modal deep search as an important and underexplored direction for future multimodal agents.

0

cs.SD 2026-05-11 2 theorems

DP segments audio to reset beamformer covariance on the fly

Online Segmented Beamforming via Dynamic Programming

Causal minimization of output power lets the algorithm adapt covariance windows to local stationarity instead of smearing across changes.

abstract click to expand

In dynamic acoustic environments characterized by time-varying interferers and moving sources, effective beamforming requires accurately identifying stationary regions over time. Traditional Capon beamformers rely on the instantaneous ensemble covariance matrix, which is inaccessible in practice. Practical implementations overcome this by estimating the sample covariance matrix (SCM) through averaging over a block of temporal samples. However, in non-stationary settings, a naive batch approach fails. Moving interferers smear the SCM, causing the beamformer to place nulls in outdated locations while failing to track newly active interferers, thereby degrading its nulling capabilities. To address this fundamental limitation, an Online Segmented Beamformer is proposed. This algorithm incorporates data-driven temporal segmentation to causally minimize output power while dynamically adapting the SCM estimation windows to local stationarity. By framing the problem through the lens of dynamic programming, the proposed method tracks abrupt environmental changes and resets covariance estimates in real-time. We validate the performance of this framework in a complex, reverberant simulated acoustic environment and in highly reverberant real world experiments, demonstrating its superiority over fixed-window adaptive methods.

0

cs.SD 2026-05-11 3 theorems

Unsupervised tokens split bee hives by queen status

BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing

A label-free autoencoder on hive audio finds repeatable codes that mark queenright versus queenless conditions and stable sub-states within

abstract click to expand

Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeVe, an unsupervised framework for acoustic state discovery in collective honey bee buzzing. BeeVe uses the self-supervised Patchout Spectrogram Transformer (PaSST) as a frozen feature extractor, then trains a Vector-Quantized Variational Autoencoder (VQ-VAE) without labels on those embeddings, learning a finite discrete codebook of acoustic tokens directly from unlabelled hive audio. No labels, pretext tasks, or contrastive objectives are used at any stage. Post-hoc evaluation against known queen status reveals that the learned tokens separate queenright and queenless conditions with Jensen-Shannon Divergence values between 0.609 and 0.688, and that the queenless condition further decomposes into three internally coherent sub-states stable across experiments with different codebook sizes and random seeds. Token transition analysis confirms non-random sequential structure (p << 0.001) across all experiments. Generalisation to unseen recordings preserves both token overlap (Jaccard = 0.947) and global manifold topology. These results demonstrate that unsupervised discrete codebook learning can recover repeatable acoustic structure from a non-vocal biological signal without annotation, opening a path toward non-invasive acoustic hive health monitoring.

0

cs.SD 2026-05-11 2 theorems

Multi-scale dilated encoder improves closed-set speaker ID

TARNet: A Temporal-Aware Multi-Scale Architecture for Closed-Set Speaker Identification

Stage-specific dilations combine short- to long-term voice traits into embeddings that beat prior networks on VoxCeleb1 and LibriSpeech at a

abstract click to expand

Closed-Set speaker identification aims to assign a speech utterance to one of a predefined set of enrolled speakers and requires robust modeling of speaker-specific characteristics across multiple temporal scales. While recent deep learning approaches have achieved strong performance, many existing architectures provide limited mechanisms for modeling temporal dependencies across different time scales, which can restrict the effective use of complementary short-, mid-, and long-term speaker characteristics. In this paper, we propose TARNet, a lightweight Temporal-Aware Representation Network for closed-set speaker identification. TARNet explicitly models temporal information at multiple time scales using a multi-stage temporal encoder with stage-specific dilation configurations. The resulting multi-scale representations are fused and aggregated via an Attentive Statistics Pooling (ASP) module to produce a discriminative utterance-level speaker embedding. Experiments on the VoxCeleb1 and LibriSpeech datasets show that TARNet outperforms state-of-the-art methods while maintaining competitive computational complexity, making it suitable for practical speaker identification systems. The code is publicly available at https://github.com/YassinTERRAF/TARNet.

0

cs.SD 2026-05-11 Recognition

Decomposed stages yield better chord variety and rules compliance

A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

Splitting retrieval, editing and reranking produces chords that are more diverse yet more musically valid than single-model systems.

abstract click to expand

Chord generation is an inherently constrained creative task that requires balancing stylistic diversity with music-theoretic feasibility. Existing approaches typically entangle candidate generation and constraint enforcement within a single model, making the diversity-feasibility trade-off difficult to control and interpret. In this work, we approach chord generation from a system-level perspective, introducing a Retrieval-Edit-Rerank (RER) framework that decomposes the task into three explicit stages: i) retrieval, which defines a stylistically plausible candidate space; ii) editing, which enforces music-theoretic feasibility through minimal modifications; and iii) reranking, which resolves soft preferences among feasible candidates. This separation provides a controllable pipeline, where each component addresses a distinct aspect of the generation process, thereby enhancing both the interpretability and adjustability of the output chords. Through objective metrics and subjective evaluation, our decomposed system outperforms all end-to-end chord generation baselines in balancing chord diversity and music-theoretic feasibility. Ablation studies further confirm the complementary roles of each stage in creative exploration and constraint satisfaction.

0

cs.SD 2026-05-11 2 theorems

Audio-video models fail to keep physics consistent in transitions

Do Joint Audio-Video Generation Models Understand Physics?

Performance drops on event and environment changes, and even top systems produce impossible results when prompted to break physical rules.

abstract click to expand

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.

0

cs.SD 2026-05-08

PianoCoRe combines sources into 157k aligned piano performances

PianoCoRe: Combined and Refined Piano MIDI Dataset

Refined MIDI dataset with quality classifier and alignment fixes yields models more robust to new pieces than raw or smaller alternatives.

abstract click to expand

Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents PianoCoRe, a large-scale piano MIDI dataset that unifies and refines major open-source piano corpora. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763 h of performed music. PianoCoRe is released in tiered subsets to support different applications: from large-scale analysis and pre-training (PianoCoRe-C and deduplicated PianoCoRe-B) to expressive performance modeling with note-level score alignment (PianoCoRe-A/A*). The note-aligned subset, PianoCoRe-A, provides the largest open-source collection of 157,207 performances aligned to 1,591 scores to date. In addition to the dataset, the contributions are: (1) a MIDI quality classifier for detecting corrupted and score-like transcriptions and (2) RAScoP, an alignment refinement pipeline that cleans temporal alignment errors and interpolates missing notes. The analysis shows that the refinement reduces temporal noise and eliminates tempo outliers. Moreover, an expressive performance rendering model trained on PianoCoRe demonstrates improved robustness to unseen pieces compared to models trained on raw or smaller datasets. PianoCoRe provides a ready-to-use foundation for the next generation of expressive piano performance research.

0

cs.SD 2026-05-08

Quantum spectrogram patches reach 0.87 AUROC in audio deepfake tests

Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features

Shallow four-qubit circuits on time-frequency patches outperform a classical SVM using the same features by five points on balanced data.

abstract click to expand

Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits with adjacency-aware entanglement. Each selected patch is summarized by a compact four-dimensional acoustic descriptor and mapped to a four-qubit circuit with depth at most three, enabling practical quantum kernel construction under near-term constraints. We evaluate Q-Patch on an audio spoofing detection task using a controlled, balanced protocol and compare it with size-matched classical baselines. Q-Patch improves discrimination between bona fide and spoofed samples, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, compared with 0.82 for a radial basis function support vector machine (RBF-SVM) trained on the same patch-level features. Kernel-space analysis further reveals a clear class structure, with cross-class similarity around 0.615 and within-class self-similarity of 1.00. Overall, Q-Patch provides a practical framework for incorporating time-frequency-aware representations into quantum kernel learning for audio authenticity assessment in low-resource settings.

0

cs.SD 2026-05-08

Melody and rhythm show no diversity correlation across cultures

Do Melody and Rhythm Coevolve?

Analysis of 27,000 songs finds rhythmic variety tracks ethnic mixing while melodic variety does not, pointing to separate pressures on each.

abstract click to expand

Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract vocal melodic pitch-interval and percussive inter-onset timing distributions from 27,628 popular songs across 59 countries, enabling large-scale cross-cultural comparison that bypasses traditional music annotations. Musical similarities between countries aligned with geographic and linguistic relationships, validating our approach. Substantial variation emerged in both melodic and rhythmic structures across countries, yet the diversity of the two components was not significantly correlated, challenging assumptions of coupled evolution. Only rhythmic diversity was significantly associated with ethnic and linguistic heterogeneity, while melodic diversity showed no such association. These findings suggest that melody and rhythm constitute partially independent systems shaped by distinct cultural and evolutionary pressures, rather than components of a single monolithic musical style.

0

cs.SD 2026-05-08

0.4B model clones voices across 30 languages without transcripts

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Two-stage training removes prompt-text requirements while matching larger systems in cross-lingual quality.

abstract click to expand

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.

0

cs.SD 2026-05-08 2 theorems

0.4B model clones any voice across 30 languages zero-shot

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Two-stage self-synthesis of audio prompts removes the need for transcripts while matching larger systems.

abstract click to expand

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.

0

cs.SD 2026-05-07

2.5K pop samples restore accuracy in jazz-tuned chord model

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

Jazz fine-tuning gains 7-9 points while pop performance drops 2.14 points without rehearsal but recovers at 1.65 times the jazz data volume.

abstract click to expand

Chord progression generation is practically important but understudied. Most large-scale symbolic music systems target melody, multi-track arrangement, or audio synthesis, and chord-only models tend to be relegated to conditioning components inside larger pipelines. This paper treats chord generation as a standalone task and addresses a question that arises whenever such a model is adapted across genres: how much old-domain data must be retained during fine-tuning to acquire a new domain without forgetting the old? I study jazz fine-tuning starting from a pop-pretrained 25M-parameter Music Transformer (84.24% top-1 chord accuracy on a held-out pop test set). The available jazz corpus is an order of magnitude smaller than the pop corpus, so every fine-tune run uses all 1,513 jazz training sequences. The swept variable is the volume of pop "rehearsal" data mixed alongside, taking values in {0, 1K, 2.5K, 5K, 10K}. Every fine-tuned model gains 7 to 9 points of jazz top-1. Pop accuracy collapses by 2.14 points under jazz-only fine-tuning, recovers to baseline at approximately 2.5K rehearsal samples (1.65x the jazz volume), and saturates beyond that point. A complementary observation: the metric-best run (F3, 2.5K mix) is not always the perceptually preferred one. The pop-leaning (10K) and jazz-leaning (1K) endpoints carry more committed stylistic identities that the author more often selects as finished output in informal listening. I discuss what this suggests for music co-creation tools but make no perceptual claim, since no formal listening study has been conducted. All six checkpoints are released on the HuggingFace Hub at https://huggingface.co/PearlLeeStudio.

0

cs.SD 2026-05-07

Gammatone-CNN hits 98.41% on underwater target sounds

Hearing the Ocean: Bio-inspired Gammatone-CNN framework for Robust Underwater Acoustic Target Classification

Cochlea-mimicking filters capture vessel harmonics in noise better than wavelet or cepstral methods.

abstract click to expand

This study presents a bio inspired signal processing framework for robust Underwater Acoustic Target Recognition (UATR). The latest state of the art methods often fail to resolve dense low frequency harmonic structures in vessel propulsion signals under high noise conditions, which is addressed by the proposed framework using a biologically inspired Gammatone filter bank that emulates the cochlea nonlinear frequency selectivity. By distributing filters according to the Equivalent Rectangular Bandwidth (ERB) scale, the framework achieves a high fidelity representation of engine radiated tonals while effectively suppressing isotropic ambient interference. The resulting Cochleagram features are processed by a lightweight, custom designed Convolutional Neural Network (CNN) that leverages large receptive fields to integrate spectral-temporal continuities. Experimental results on the VTUAD dataset demonstrate a state of the art classification accuracy of 98.41%, outperforming Continuous Wavelet Transform and Mel Frequency Cepstral Coefficients baselines by 3.5% and 7.7% respectively. Furthermore, the framework achieves an inference latency of only 0.77 ms and a 0.971 Cohen Kappa score, validating its efficacy for real time deployment on autonomous, low-power sonar hardware.

0

cs.SD 2026-05-07 Recognition

Bangla ASR hits 0.2441 WER after Whisper fine-tuning

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

Custom data and augmentations also lower diarization error to 0.2392 over pretrained baselines on test recordings.

abstract click to expand

Automatic Speech Recognition (ASR) and speaker diarization in Bangla remain challenging due to long form recordings, diverse acoustic conditions, and significant speaker variability. This work addresses these two core tasks in Bangla spoken language understanding by developing robust systems for long form ASR and speaker diarization. For ASR (Problem 1), we fine tune the tugstugi bengaliai regional asr whisper medium model on a custom-curated dataset of approximately 15,000 chunked and aligned Bangla audio segments, employing full weight training with extensive data augmentation including noise injection, reverb simulation, echo, clipping distortion, and pitch/time perturbation. For speaker diarization (Problem 2), we fine-tune the pyannote/segmentation-3.0 model using PyTorch Lightning on the competition annotated diarization dataset, swapping the fine-tuned segmentation backbone into the pyannote/speaker-diarization-community-1 pipeline while retaining the pretrained speaker embedding and clustering components. Our ASR system achieves a Word Error Rate (WER) of 0.2441, while our diarization system achieves a Diarization Error Rate (DER) of 0.2392, both evaluated on the test set, demonstrating notable improvements over the respective pretrained baselines. We describe our complete pipeline, including data preprocessing, text normalization, audio augmentation, training strategies, inference optimization, and post-processing for both tasks.

0

cs.SD 2026-05-07

One audio model transcribes singing to aligned lyrics and melody

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

Interleaved prompts plus lyrics-first chain-of-thought yield structured scores without separate alignment stages.

abstract click to expand

High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly necessary. Despite their utility, current automatic transcription systems face significant challenges: they often rely on complex multi-stage pipelines, struggle to recover text-note alignments, and exhibit poor generalization to out-of-distribution (OOD) singing data. To alleviate these issues, we present VocalParse, a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Specifically, our novel contribution is to introduce an interleaved prompting formulation that jointly models lyrics, melody, and word-note correspondence, yielding a generated sequence that directly maps to a structured musical score. Furthermore, we propose a Chain-of-Thought (CoT) style prompting strategy, which decodes lyrics first as a semantic scaffold, significantly mitigating the context disruption problem while preserving the structural benefits of interleaved generation. Experiments demonstrate that VocalParse achieves state-of-the-art SVT performance on multiple singing datasets. The source code and checkpoint are available at https://github.com/pymaster17/VocalParse.

0

cs.SD 2026-05-07

LLMs trail on audio tasks but no architecture wins

Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)

Benchmark of Gemini and GPT families across eight sound capabilities shows persistent gap yet leaves design choice open to use-case needs.

abstract click to expand

The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a single multimodal backbone may replace complex, task-specific pipelines. This paper provides a rigorous empirical evaluation of leading LLMs - including members from the Gemini and GPT families - across the eight core MSEB capabilities to assess their efficacy and audio-text parity. Our results indicate that while a significant modality gap persists regarding performance and robustness, the empirical evidence for an "optimal" modeling approach remains inconclusive. Ultimately, the choice between audionative and cascaded architectures depends heavily on specific use-case requirements and the underlying assumptions regarding latency, cost, and reasoning depth.

1 0

0

cs.SD 2026-05-07

Stage detection improves audio diffusion training

Stage-adaptive audio diffusion modeling

Tracking the slope of semantic discrepancy allows adaptive adjustments that speed convergence and raise metric scores in generation and rest

abstract click to expand

Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remains computationally expensive, and most existing pipelines still rely on static optimization recipes that treat the relative importance of training signals as fixed throughout learning. In this work, we argue that a major source of inefficiency lies in the evolving balance between semantic acquisition and generation-oriented refinement. Early training places stronger emphasis on acquiring condition-aligned semantic structure and coarse global organization, whereas later training increasingly emphasizes temporal consistency, perceptual fidelity, and fine-detail refinement. To characterize this evolving balance, we introduce a progress-based regime variable derived from the training-time slope of an SSL-space discrepancy, which measures semantic progress during training. Based on this signal, we develop three complementary stage-aware mechanisms: decayed SSL guidance for early semantic bootstrapping, self-adaptive timestep sampling driven by the regime variable, and structure-aware regularization activated from convergent grouped organization in parameter space. We evaluate these mechanisms on text-conditioned audio generation and audio-conditioned super-resolution. Across both settings, the proposed stage-aware strategies improve convergence behavior and yield gains on the primary generation and spectral reconstruction metrics over standard static baselines. These results support the view that efficient audio diffusion training can benefit from treating external guidance, internal organization, and optimization emphasis as stage-dependent components rather than fixed ingredients.

0

cs.SD 2026-05-06 2 theorems

Web tool maps ship underwater noise worldwide in near real time

ShipEcho -- An Interactive Tool for Global Mapping of Underwater Radiated Noise from Vessels

Public AIS data and standard models produce actionable noise maps for marine planning without specialist sensors.

abstract click to expand

Underwater radiated noise from vessels (V-URN) is a recognized environmental stressor that negatively impacts marine ecosystems. Significant resources are invested in the development of V-URN monitoring indicators, regulatory frameworks, and management-oriented assessments. One approach with high potential for impact is V-URN mapping, which can provide actionable spatiotemporal information for environmental assessment and mitigation planning. Producing management-scale maps remains challenging as passive acoustic measurements are spatially sparse and many operational systems depend on specialist workflows and costly access to wide-area vessel activity data. To address these constraints, we introduce ShipEcho, a freely accessible web-based Geographic Information System (GIS) that provides near-real-time V-URN mapping using vessel data acquired through a community-based AIS exchange. Using established vessel SL models and propagation modeling informed by bathymetric data, ShipEcho produces near-real-time and cumulative noise maps across regions worldwide. These include sound pressure levels and sound exposure levels using standard indicators, including the 63~Hz and 125~Hz one-third octave bands and a 20--2000~Hz broadband level. We describe the system architecture, data pipeline, modeling workflow, and key assumptions, and evaluate map accuracy through comparison with acoustic recordings. We then demonstrate how ShipEcho can support management-level assessment, decision-making, and policy initiatives through practical use cases.

0

cs.SD 2026-05-06

0.1B omni model reaches 0.09 CER in speech-text consistency

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Open system uses frozen encoders and middle-layer bridging for text, speech and image inputs, with full code and datasets released.

abstract click to expand

MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio training, making the complete interaction loop directly inspectable. The model uses a full MiniMind backbone as the Thinker and an independent four-layer Talker made from MiniMind blocks. Frozen SenseVoice-Small and SigLIP2 encoders provide speech and image features, which are mapped by lightweight MLP projectors and injected at modality-placeholder positions. The Talker reads a middle-layer Thinker state together with an autoregressive eight-layer Mimi-code buffer. Speaker control is handled by a dedicated speaker token, right-aligned reference codec prompts, and precomputed CAM++ speaker embeddings, so voice conditioning remains part of the audio-code context rather than a separate TTS module. With a 768-dimensional Talker, the dense and MoE variants reach average CERs of 0.0897 and 0.0900 in Thinker--Talker consistency evaluation, with overall voice-cloning similarities of 0.5995 and 0.5937. Beyond reporting a working system, the paper identifies three scale-critical design choices for small omni models: middle-layer semantic bridging, a released multimodal sequence format, and a parameter-efficient eight-codebook interface.

0

cs.SD 2026-05-06

Models spot unknown sounds and learn them incrementally

Towards Open World Sound Event Detection

WOOT uses deformable attention and feature separation to flag novel events while matching closed-world leaders on known ones.

abstract click to expand

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.

0

2

cs.SD 2026-05-06

PHALAR improves audio stem retrieval accuracy up to 70% with half the parameters

PHALAR: Phasors for Learned Musical Audio Representations

Contrastive model using learned spectral pooling and complex head sets new benchmarks on stem retrieval while capturing beat and chord info.

abstract click to expand

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

0

2

cs.SD 2026-05-06 3 theorems

Phasor model lifts stem retrieval accuracy 70% with half the parameters

PHALAR: Phasors for Learned Musical Audio Representations

Learned pooling and complex processing enforce musical equivariances for faster, lighter stem matching across three datasets.

abstract click to expand

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

0

2

cs.SD 2026-05-06 2 theorems

PHALAR raises stem retrieval accuracy up to 70% with under half the parameters

PHALAR: Phasors for Learned Musical Audio Representations

A contrastive model adds pitch and phase equivariance through spectral pooling and complex heads, improving musical stem matching and zero-

abstract click to expand

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

0

cs.SD 2026-05-06

Task vectors merge bioacoustic models into 661-species classifier

Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data

Independently trained encoders on separate taxa combine arithmetically without data sharing, preserving privacy and boosting rare-species 0.

abstract click to expand

Training data for bioacoustics is scattered across taxa, regions, and institutions. Centralizing it all is often infeasible. We show that independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data. We find that bioacoustic task vectors are near-orthogonal (cosine 0.01-0.09). Their separation aligns closely with spectral distribution distance, a gradient consistent with the acoustic niche hypothesis. This geometry makes simple averaging optimal while sign-conflict methods reduce accuracy by one to six percentage points. Composition also creates an asymmetric gap: species-rich groups lose accuracy relative to joint training while underrepresented taxa gain, a redistribution useful for equitable biodiversity monitoring. We verify linear mode connectivity across all taxonomic pairs, demonstrate zero-shot transfer to new regions, and identify domain negation as a boundary condition where composition fails. These results enable a collaborative paradigm for bioacoustics where institutions share only task vectors to assemble multi-taxa classifiers, preserving data privacy.

1 0

0

cs.SD 2026-05-06

Python package unifies music performance feature pipelines

Cosmodoit: A Python Package for Adaptive, Efficient Pipelining of Feature Extraction from Performed Music

Cosmodoit adds alignment, symbolic-audio extraction, and dependency-aware updates to cut redundant work on performed music

abstract click to expand

Computational analysis of performed music is a key component of music information research, as performance shapes much of the music we hear. Music performance analysis studies the acoustic variations introduced by performers and how these variations reflect musical interpretation and structure. Although many algorithms and tools exist for tasks such as performance-to-score alignment and symbolic or audio feature extraction, they are spread across different programming languages and data formats, making them difficult to combine efficiently. To address this problem, we present Cosmodoit, a novel Python package designed to streamline feature extraction from performed music. Cosmodoit integrates performance-to-score alignment with symbolic and audio feature extraction in a modular, flexible pipeline that supports selective processing, dependency-aware computation, and incremental updates. Its extensible design reduces duplicated work, minimizes errors, and enables efficient large-scale processing. By accommodating algorithms implemented in multiple languages and allowing parameter tuning for consistent feature extraction, Cosmodoit provides a versatile and practical tool for both research and development in music performance analysis.

0

cs.SD 2026-05-06

Dual fusion model detects speech and environment deepfakes separately

Deepfake Audio Detection Using Self-supervised Fusion Representations

XLS-R and BEATs branches with Matching Head reach 70.2% F1 and 16.54% environmental EER by isolating manipulated audio components.

abstract click to expand

This paper describes a submission to the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) 2026, which addresses component-level deepfake detection using the CompSpoofV2 dataset, where speech and environmental sounds may be independently manipulated. To address this challenge, a dual-branch deepfake detection framework is proposed to jointly model speech and environmental contextual representations from input audio. Two pretrained models, XLS-R for speech and BEATs for environmental sound, are used to extract complementary contextual representations. A Matching Head is introduced to model representation differences through statistical normalization and representation interaction, enabling estimation of the original class. In parallel, multi-head cross-attention enables effective information exchange between speech and environmental components. The refined representations are processed with residual connections and layer normalization, and passed to an AASIST classifier to predict speech-based and environment-based spoofing probabilities. The model outputs original, speech, and environment predictions. On the test set, the proposed system achieves an F1-score of 70.20% and an environmental EER of 16.54%, outperforming the baseline system.

0

cs.SD 2026-05-06

Tiny neural net spots shearwater calls on battery-powered recorder

Smart Passive Acoustic Monitoring: Embedding a Classifier on AudioMoth Microcontroller

An optimized 1D-CNN runs in 20 milliseconds on AudioMoth, enabling selective recording to save power and storage for long-term monitoring.

abstract click to expand

Passive Acoustic Monitoring (PAM) is an efficient and non-invasive method for surveying ecosystems at a reduced cost. Typically, autonomous recorders allow the acquisition of vast bioacoustic datasets which are then analyzed. However, power consumption and data storage are both scarce and limit the duration of acquisition campaigns. To address this issue, we propose a smart PAM system which allows the in-situ analysis of the soundscape by embedding a classifier directly onto an AudioMoth microcontroller. Specifically, we propose an optimized yet simple 1D Convolutional Neural Network (1D-CNN) to classify the raw audio. The model focuses on the specific call of Scopoli Shearwater seabirds (endangered species) and is trained on a real-world dataset with a classification accuracy of 91\% (balanced accuracy of 89\%). We also propose a process to optimize the model to fit the severe resource constraints of the AudioMoth, achieving a \~10kB RAM memory footprint and 20ms inference time. Finally, we present an open-source tutorial of our model optimization and export strategy which can be used for embedding models beyond the scope of our study. Our modified version of the AudioMoth firmware adds two functions: (F1) which selectively records data when the target species has been detected and (F2) which logs the continuous classification results in real time. This work intends to facilitate the conception of intelligent sensors, enhancing the efficiency and scalability of bioacoustic monitoring campaigns.

0

cs.SD 2026-05-06

Aesthetic features lift AI music preference prediction on unseen generators

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

Joint training on popularity and quality dimensions from 211k tracks improves human battle accuracy across eleven new systems.

abstract click to expand

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

1 0

0

cs.SD 2026-05-06

Contrastive loss cuts unseen-accent ASR errors by 25-29%

Contrastive Regularization for Accent-Robust ASR

Utterance-level regularization improves encoder stability during CTC fine-tuning without accent labels or architecture changes.

abstract click to expand

ASR systems based on self-supervised acoustic pretraining and CTC fine-tuning achieve strong performance on native speech but remain sensitive to accent variability. We investigate supervised contrastive learning (SupCon) as a lightweight, accent-invariant auxiliary objective for CTC fine-tuning. An utterance-level contrastive loss regularizes encoder representations without architectural modification or explicit accent supervision. Experiments on the L2-ARCTIC benchmark show consistent WER reductions across multiple pretrained encoders, with up to 25 -- 29\% relative reduction under unseen-accent evaluation. Analysis using within-transcript cosine dispersion indicates that SupCon promotes more compact and stable representation geometry under accent variability. Overall, SupCon provides an effective and model-agnostic regularization strategy for improving accent robustness.

1 0

0

cs.SD 2026-05-05

Phoneme checks detect emotional deepfakes

Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings

Certain sounds diverge more in manipulated speech, allowing consistent detection across emotions and synthesis systems.

abstract click to expand

Recent advances in emotional voice conversion (EVC) have enabled the generation of expressive synthetic speech, raising new concerns in audio deepfake detection. Existing approaches treat speech as a homogeneous signal and largely overlook its internal phonetic structure, limiting their interpretability in emotionally conditioned settings. In this work, we propose a phoneme-level framework to analyze emotionally manipulated synthetic speech using real and EVC-generated speech under matched emotional conditions with shared transcripts, phoneme-aligned TextGrids, and WavLM-based embeddings. Our results show that phoneme behavior varies across categories, with complex vowels and fricatives exhibiting higher divergence while simpler phonemes remain more stable. Phonemes with larger distributional differences are also found to be more easily detected, consistently across multiple emotions and synthesis systems. These findings demonstrate that phoneme-level analysis is an effective and interpretable approach for detecting emotionally manipulated synthetic speech.

0

cs.SD 2026-05-05

Offline distillation from DP teacher prevents collapse in private speech classifiers

Private Speech Classification without Collapse: Stabilized DP Training and Offline Distillation

A multimodal teacher is trained privately on D_priv then its outputs are distilled once to an audio-only student on recording-disjoint D_aux

abstract click to expand

We study example-level private supervised speech classification under a practical release constraint: training may access privileged side information, but the released model must be audio-only. This setting is important because speech systems can often exploit richer side information during development, whereas deployment and release require a lightweight unimodal model with auditable privacy guarantees. Using DP-SGD on the private dataset $D_{\text{priv}}$, we identify a strong-privacy failure mode ($\epsilon \le 1$) on imbalanced tasks, where training may collapse to a near single-class predictor, a phenomenon that overall accuracy can obscure. We therefore emphasize Macro-F1, balanced accuracy, and a simple collapse diagnostic. This failure is especially problematic in our release setting because a collapsed private teacher cannot provide useful supervision for the downstream audio-only student. To address this setting under strong privacy, we propose a two-stage protocol: (i) train a (possibly multimodal) DP teacher on $D_{\text{priv}}$, and (ii) distill an audio-only student on a fixed, recording-disjoint auxiliary dataset $D_{\text{aux}}$ using one-shot offline teacher probability outputs, releasing only the student. The DP guarantee applies only to $D_{\text{priv}}$; we make no DP claim for $D_{\text{aux}}$, and privacy of the released student with respect to $D_{\text{priv}}$ follows by post-processing. We frame this setting as involving four coupled bottlenecks: speech-induced optimization instability under DP-SGD, minority-class erosion under clipping and noise, teacher over-reliance on privileged modalities unavailable at deployment, and train--deploy modality mismatch. We address them with a DP-stabilizing acoustic front-end (DSAF), minibatch-adaptive bounded loss reweighting (AW-DP), privileged-modality dropout, and offline teacher-to-student distillation.

0

cs.SD 2026-05-05

Large-model adaptation yields usable Tibetan speech synthesis

Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

Data enhancement, text tokenizer changes and cross-lingual training produce natural output despite scarce native recordings.

abstract click to expand

Tibetan text-to-speech (TTS) has long been challenged by scarce speech resources, significant dialectal variation, and the complex mapping between written text and spoken pronunciation. To address these issues, this work presents, to the best of our knowledge, the first large-model-based Tibetan TTS system in the industry, built upon a large speech synthesis model developed by Xingchen AGI Lab. The proposed system integrates data quality enhancement, Tibetan-oriented text representation and tokenizer adaptation, and cross-lingual adaptive training for low-resource Tibetan speech synthesis. Experimental results show that the system can generate stable, natural, and intelligible Tibetan speech under low-resource conditions. In subjective evaluation, the MOS scores of the syllable-level and BPE-based systems reach 4.28 and 4.35, while their pronunciation accuracies reach 97.6% and 96.6%, respectively, outperforming an external commercial Tibetan TTS interface. These results demonstrate that combining a large-model backbone with Tibetan-oriented text representation adaptation and cross-lingual adaptive training enables highly usable low-resource Tibetan speech synthesis, and also provides a technical foundation for future unified multi-dialect Tibetan speech synthesis.

0

cs.SD 2026-05-05

Word-level speech inpainting evades standard deepfake detectors

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

MIST dataset and ISA method locate 1-3 small tampered segments that comprise just 2-7% of each utterance.

abstract click to expand

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.

0

cs.SD 2026-05-04 2 theorems

Margin loss on pre-trained features lifts language ID accuracy

Spoken Language Identification with Pre-trained Models and Margin Loss

The method achieves 85.95% macro accuracy and 17.08% EER on Tidy-X by separating language from speaker traits

abstract click to expand

For the speaker-controlled spoken language identification task proposed in the TidyLang Challenge 2026, this paper proposes a language identification method based on pre-trained models and margin-based losses. The proposed method adopts a pre-trained ECAPA-TDNN as the feature encoder and incorporates margin-based losses to enhance the discriminative ability of language representations, thereby improving inter-class separability and reducing the interference of non-linguistic factors such as speaker characteristics. Experimental results on the Tidy-X dataset show that the proposed method achieves 85.95% macro accuracy and 90.96% micro accuracy on the language identification task and 17.08% equal error rate (EER) on the verification task. Compared with the official baseline, the macro accuracy improves by 45.7%, the micro accuracy improves by 15.2%, and the EER is reduced by approximately 50.8%, demonstrating the effectiveness of the proposed method. The code will be released at https://github.com/PunkMale/TidyLang2026.

1 0

0

cs.SD 2026-05-04

Benchmark shows AI music-dance pairs often miss the beat

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

TMD-Bench finds leading generators produce quality music and video but deliver inconsistent beat-level alignment between them.

abstract click to expand

Unified audio-visual generation is rapidly gaining industrial and creative relevance, enabling applications in virtual production and interactive media. However, when moving from general audio-video synthesis to music-dance co-generation, the task becomes substantially harder: musical rhythm, phrasing, and accents must drive choreographic motion at fine temporal resolution, and such rhythmic coupling is not captured by unimodal metrics or generic audiovisual consistency scores used in current evaluation practice. We introduce TMD-Bench, a benchmark for text-driven music-dance co-generation that assesses systems across unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. The benchmark integrates computable physical metrics with perceptual multimodal judgments, and is supported by a curated rhythm-aligned music-dance dataset and a fine-grained Music Captioner for structured music semantics. TMD-Bench further reveals that (i) modern commercial audio-visual models, such as Veo 3 and Sora 2, produce high-quality music and video, while rhythmic coupling remains less consistently optimized and leaves room for improvement, and (ii) our unified baseline RhyJAM trained on rhythm-aligned data achieves competitive beat-level synchronization while maintaining competitive unimodal fidelity. This presents prospects for building next-generation music-dance models that explicitly optimize rhythmic and kinetic coherence.

0

cs.SD 2026-05-04

Music generation works with one acoustic token hierarchy

Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation

A 64-layer RVQ model produces both structure and fine details in a single space, with lyric alignment appearing without separate semantic

abstract click to expand

A common design pattern in high-quality music generation is to handle structure and fidelity in different representation spaces: a generator first models high-level structure, followed by diffusion-based or neural decoding stages that reconstruct fine details. In this work, we explore an alternative view: both may be progressively modeled within a single deep acoustic-token hierarchy. To study this, we build a 64-layer residual vector quantization (RVQ) acoustic representation and propose a two-stage coarse-to-fine generation framework. A backbone model first generates coarse acoustic tokens for the full track, and a super-resolution model then completes finer tokens within the same acoustic token space. The super-resolution stage works at full-track scale and refines tokens layer by layer while running in parallel over time, leading to a fixed 62-step inference process. To jointly improve lyric alignment and fine-detail reconstruction, we further introduce hybrid-attention training: the alignment objective uses causal attention, while layer-wise refinement uses full attention. A key finding is that text--vocal alignment can emerge within pure acoustic-token language modeling, without requiring a separate semantic token stage. Moreover, initializing the super-resolution model from the trained backbone significantly improves convergence and final quality. Taken together, our results suggest that high-quality music generation can be effectively pursued without separating structure and fidelity into heterogeneous representation spaces. Instead, both can be progressively modeled within a unified acoustic-token hierarchy, pointing toward a simpler and more unified path to high-quality music generation.

0

cs.SD 2026-05-04

Correcting fusion bottlenecks lifts AV task performance

Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

Estimating readiness deficiency and applying support-aware fixes in intermediate layers improves separation, localization, and recognition.

abstract click to expand

Stage-wise audio-visual encoders propagate fused intermediate states across layers, making the formation of later representations depend on the readiness of earlier fusion states. Strong local audio-visual agreement provides useful correspondence evidence, yet a fused state also needs sufficient cross-layer and cross-modal support before it can reliably guide later fusion. This paper studies this issue through propagation-aware representation readiness and formulates premature perceptual commitment as a readiness-deficiency problem, where local plausibility, propagation influence, and support insufficiency jointly appear at an intermediate stage. We propose the Delayed Perceptual Commitment Network (DPC-Net), an encoder-level framework that estimates an observable readiness-deficiency surrogate, localizes the intervention-sensitive bottleneck, and applies support-aware correction with cross-layer and cross-modal evidence. DPC-Net preserves task-specific heads, losses, decoding modules, and evaluation protocols, making it applicable to different audio-visual tasks through encoder-side intervention. Experiments on audio-visual speech separation, audio-visual event localization, and audio-visual speech recognition show consistent improvements across reconstruction, localization, and recognition regimes. Further analyses on component contribution, selection criteria, counterfactual intervention, and readiness trajectories support the effectiveness of readiness-guided bottleneck correction.

0

cs.SD 2026-05-04

Mel-domain watermarking attributes AI speech after distortions

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

Binary payloads added to the Mel-spectrogram before synthesis yield near-100 percent extraction accuracy under compression and noise.

abstract click to expand

In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and HiFi-GAN. Moreover, the multi-user keyed construction enables scalable user-specific attribution, while the keyed verification mechanism limits unauthorized decoding, thereby reducing the risk of large-scale extractor probing and adversarial analysis. Extensive experiments on DiffWave and HiFi-GAN demonstrate that MelShield achieves reliable watermark extraction, approaching 100\% bit accuracy, even under signal distortions, e.g., compression and additive noise, while preserving high perceptual audio quality.

0

cs.SD 2026-05-04

Closed-loop EEG adapts music to real-time emotions

MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention

Brain signals are decoded into valence and arousal states that guide music synthesis with continuous feedback, yielding better alignment and

abstract click to expand

Driven by the escalating global burden of mental health conditions, music-based interventions have attracted significant attention as a non-invasive, cost-effective modality for emotion regulation and psychological stress relief. However, current digital music services rely on static preferences and fail to adapt to users' instantaneous psychological states. Furthermore, directly mapping electroencephalography (EEG) to music generation remains challenging due to severe paired-data scarcity and a lack of interpretability. To address these limitations, we propose MindMelody, a fully functional, closed-loop real-time system for EEG-driven personalized music intervention. MindMelody introduces an emotion-mediated semantic bridge. Specifically, a hybrid Transformer-GNN first decodes real-time EEG signals into global Valence-Arousal states and local temporal affect trajectories. These states are then fed into a Retrieval-Augmented Generation (RAG)-equipped Large Language Model (LLM) to formulate structured intervention plans. Subsequently, a novel Hierarchical EEG Controller injects global affect prefixes and local temporal guidance into a pretrained music backbone, enabling fine-grained controllable audio synthesis. Crucially, the system incorporates a continuous feedback loop that updates generation parameters on the fly based on the user's evolving EEG dynamics. Extensive experiments show that MindMelody improves control adherence and emotional alignment, and receives higher perceived helpfulness in a short-term listening setting, suggesting its promise as an adaptive affect-aware music generation framework.

0

cs.SD 2026-05-04

Transformer turns music into 3D conducting gestures

MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation

TransConductor predicts detailed body poses from audio to match beats and structure, beating prior methods in alignment tests.

abstract click to expand

Generating expressive conducting gestures from music is a challenging cross-modal motion synthesis problem: the output must follow long-range musical structure, preserve beat-level synchronization, and remain plausible as a fine-grained 3D human performance. Existing conducting-motion studies are often limited by sparse pose representations, small-scale data, or evaluation protocols that do not directly measure whether music and gesture are mutually aligned. This paper presents TransConductor, a Transformer-based framework for music-driven conducting gesture generation. We introduce ConductorMotion, a SMPL-parameter data construction pipeline that recovers detailed body motion from conducting videos and forms a dataset targeted at professional conducting gestures. Given acoustic descriptors extracted from audio and an initial pose, TransConductor uses a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder to autoregressively predict SMPL pose parameters. To better assess artistic correspondence, we further build a retrieval-based evaluation model that embeds music and gestures into a shared space and yields FID, modality distance, multi-modality distance, and diversity metrics. Experiments show that TransConductor outperforms dance-generation and conducting-generation baselines, while ablations verify the benefits of the Transformer backbone and the proposed alignment loss.

0

cs.SD 2026-05-04

Adversarial head erases script leakage from speaker embeddings

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Trained on synthetic pairs, LASE drives cross-script cosine gap to zero and matches baselines in diarization with far less data.

abstract click to expand

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.

0

cs.SD 2026-05-04

Top models reach only 68% on medical audio questions

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

MedMosaic dataset of 46k questions tests AI on real sounds, synthetic voices, and conversations, revealing persistent reasoning limits.

abstract click to expand

We present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address these challenges, MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models.

0

cs.SD 2026-05-04

Filtered generative RIRs halve speaker distance errors

Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation

Augmenting sparse datasets with quality-checked impulse responses improves model accuracy on medium and long distances in simulated rooms.

abstract click to expand

The Room Acoustics and Speaker Distance Estimation (SDE) Challenge at ICASSP 2025 explores the effectiveness of augmented room impulse response (RIR) data for improving SDE model performance. This challenge at GenDARA involves generating RIRs to supplement sparse datasets and fine-tuning SDE models with the augmented data. We employ the open-source fast diffuse room impulse response generator (FastRIR) conditioned only on speaker and listener locations. We design a quality filter to ensure generated RIR alignment with challenge RIRs, and hyperparameter optimization is employed for model fine-tuning. Our approach reduces the mean absolute error (MAE) of the five positions from 1.66m to 0.6m for GWA rooms and from 2.18m to 0.69m for Treble rooms, with results demonstrating that the augmentation approach significantly improves estimation accuracy, particularly at medium to long distances.

1 0

0

cs.SD 2026-05-04

Joint audio and label generation from video reaches 75% onset accuracy

MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

The model learns sound and its timing together to avoid error buildup from separate detection steps.

abstract click to expand

Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.

0

cs.SD 2026-05-04

Pretrained video-to-audio model estimates room acoustics

MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation

Fine-tuning without architecture changes turns one model into a tool for both dereverberation and impulse response prediction.

abstract click to expand

Although recent video-to-audio (V2A) models excelled at synthesizing semantically plausible sounds from visual inputs, they do not explicitly model room-acoustic effects such as reverberation or room impulse responses (RIRs), and thus offer limited controllability over these effects. However, we hypothesize that such V2A models implicitly have semantic knowledge of the relationship between spatial audio and the corresponding vision cues. In this paper, we revisit a V2A model for the sake of the above, and propose the way to utilize the pretrained model as prior for physically grounded room-acoustic processing. Based on one of the state-of-the-art V2A models, MMAudio, we propose MMAudioReverbs that is a unified framework dealing with i) dereverberation and ii) room impulse response (RIR) estimation without network architectural modification, and fine-tuned on a small dataset. Experimental results showed that audio and visual cues respectively have advantage depending on the type of physical room acoustics. It implies that foundation V2A models can be used for physically grounded room-acoustic analysis.

0

cs.SD 2026-05-04

Single model masters both timing and structure in music

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

GaMMA routes music signals through expert encoders and trains in stages to reach new accuracy levels on benchmarks that test global and time

abstract click to expand

In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.

0

cs.SD 2026-05-04

One-step sampling matches multi-step audio quality at 8.5x speed

Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

Energy-scoring head and distillation from masked autoregressive model beat prior fast baselines while closing gap to slower top systems on A

abstract click to expand

Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi-step sampling. Compared to the state-of-the-art AR diffusion system, IMPACT, our approach achieves up to $8.5$x faster batch inference with highly competitive audio quality. These results demonstrate that combining energy-distance training with representation-level distillation provides an effective recipe for fast, high-quality text-to-audio synthesis.

0

cs.SD 2026-05-01

New pretraining creates encoder that spots voice deepfakes more reliably

Alethia: A Foundational Encoder for Voice Deepfakes

Bottleneck masked embeddings and flow-matching spectrogram reconstruction deliver better robustness and zero-shot generalization across 56 5

abstract click to expand

Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on $5$ different tasks with $56$ benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.

0

cs.SD 2026-05-01

Accent conversion moves from rule-based DSP to neural flexibility

Accent Conversion: A Problem-Driven Survey of Sociolinguistic and Technical Constraints

The evolution addresses data alignment, feature separation, and limited resources, with apps varying how much accent changes versus voice is

abstract click to expand

Accent conversion has rapidly progressed alongside growing interest in improving global cross-cultural communication. This survey presents an overview of the evolution of accent conversion methodologies, analyzing how the field has developed in response to fundamental challenges related to data alignment, representation disentanglement, and resource scarcity. We trace the progression from early rule-based digital signal processing approaches such as spectral manipulation and formant-based analysis to modern neural architectures capable of flexible and reference-free accent transformation. In addition, the survey situates accent conversion within its linguistic foundations and examines how different application requirements impose varying constraints on the balance between accent modification and speaker identity preservation. Finally, it reviews commonly used speech datasets and evaluation methodologies, identifies persistent challenges, and outlines directions for future research aimed at achieving more controllable and perceptually consistent accent conversion.

0

cs.SD 2026-05-01

Model predicts severe stuttering events from prior three seconds

Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device

Overall accuracy modest at 0.58 AUC, but blocks and sound repetitions reach 0.60-0.62 while fillers stay at chance; runs locally on phones.

abstract click to expand

Audio-based stuttering systems to date have been trained for detection -- what disfluency is present now -- leaving prediction, the capability needed for closed-loop intervention, unstudied at deployable scale. We train a 616K-parameter CNN on SEP-28k (Apple, 20,131 three-second clips) to predict whether the next contiguous clip contains any disfluency. (1) Severity-selective precursor signal: on the episode-grouped test set, aggregate preblock AUC is modest (0.581 [0.542, 0.619]), but stratifying by upcoming event type reveals concentration on clinically severe events -- blocks 0.601 [0.554, 0.651] and sound repetitions 0.617 [0.567, 0.667] both exclude chance, while fillers (0.45) and word repetitions (0.49) are at chance. The aggregate objective converges to a severity-selective predictor because severe events carry prosodic precursors; fillers do not. (2) Cross-population transfer: without fine-tuning, the same checkpoint applied to 1,024 pediatric Children-Who-Stutter utterances (FluencyBank Teaching) attains AUC 0.674 detection and 0.655 prediction; DisfluencySpeech and LibriStutter reach 0.58-0.60 AUC. (3) Deployable on-device: lossless export to CoreML (1.19 MB), ONNX (40 KB), TFLite. Neural-Engine latency per 3 s window: 0.25 ms (iPhone 17 Pro Max, A19 Pro) to 0.55 ms (iPhone SE 3rd-gen and M1 Max). A 4 Hz streaming simulation uses 0.54% of the real-time budget. Platt-calibrated outputs (test ECE 0.010, from 0.177 raw). Five negative ablations -- output-level Future-Guided Learning, multi-clip GRU, time-axis concatenation, asymmetric focal loss, direct block-targeted training -- none improved over the vanilla baseline.

0

cs.SD 2026-05-01

LLM phoneme edits from under 10 utterances cut accented ASR errors

Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing

Synthetic accented speech generated this way lowers word error rates on real tests more than random phoneme changes.

abstract click to expand

Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. Prior work has explored accent modeling in low-resource settings, but existing approaches typically require minutes to hours of labeled speech, which may still be impractical for truly scarce accent scenarios. We propose a pipeline that adapts a text-to-speech (TTS) decoder to a target-accent speaker using fewer than ten reference utterances and employs large language model (LLM)-based phoneme editing to generate accent-conditioned pronunciations. The resulting synthetic speech is used to fine-tune a self-supervised ASR model. Experiments demonstrate consistent word error rate (WER) reductions on real accented speech, including cross-speaker evaluation and ultra-low data regimes. A matched-rate random phoneme baseline shows that phoneme-space perturbation itself is a strong form of augmentation, while LLM-guided edits provide additional gains through accent-conditioned structure.

1 0

0

cs.SD 2026-04-30

Non-speech audio reveals spurious correlations in speech data

A Toolkit for Detecting Spurious Correlations in Speech Datasets

A classifier trained only on silence and noise flags when target labels leak through recording artifacts.

abstract click to expand

We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the training and test data, these correlations result in an overestimation of the system performance -- a dangerous situation, specially in high-stakes application where systems are required to satisfy minimum performance requirements. Our toolkit implements a diagnostic method based on the detection of the target class using only the non-speech regions in the audio. Better than chance performance at this task indicates that information about the target class can be extracted from the non-speech regions, flagging the presence of spurious correlations. The toolkit is publicly available for research use.

1 0

0

cs.SD 2026-04-30

Dictionary learning cleans low-frequency room impulse responses

Full band denoising of room impulse response in the wavelet domain with dictionary learning

Adapting error tolerance with an exponential decay model yields better acoustic parameter estimates than standard wavelet thresholding.

abstract click to expand

Conventional wavelet-domain methods for room impulse response denoising rely on thresholding detail coefficients, which is unsuited for low frequencies. In this work, we introduce a wavelet-based post-processing algorithm that extends denoising to approximation coefficients by means of sparse dictionary learning with a time-varying error tolerance. The proposed method leverages an exponential decay envelope model to adapt reconstruction accuracy according to the local signal-to-noise ratio. This approach significantly improves low-frequency denoising of synthetic and measured room impulse responses compared to the baseline method, leading to more accurate estimation of acoustic parameters such as decay time.

1 0

0

cs.SD 2026-04-30

Diffusion reconstruction boosts audio deepfake detection on new attacks

Diffusion Reconstruction towards Generalizable Audio Deepfake Detection

Training on reconstructed hard samples equips models to identify deepfakes from generators not encountered before.

abstract click to expand

Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable of distinguishing challenging hard samples is inherently equipped to handle simpler cases effectively. We investigate multiple reconstruction paradigms, identifying the diffusion-based method as optimal for generating hard samples. Furthermore, we leverage multi-layer feature aggregation and introduce a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability. Experiments demonstrate the superior generalization of our approach, with our best model achieving a significant reduction in the average Equal Error Rate (EER) compared to the baseline.

1 0

0

cs.SD 2026-04-30

Recurrence patterns in speech detect depression with AUC 0.689

Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech

Nonlinear analysis of vocal trajectories yields biomarkers that outperform static acoustic baselines with p=0.004 significance.

abstract click to expand

Digital biomarkers for depression have largely relied on static acoustic descriptors, pooled summary statistics, or conventional machine learning representations. Such approaches may miss nonlinear temporal organization embedded in conversational vocal dynamics. We hypothesized that depression is associated with altered recurrence structure in vocal state trajectories, reflecting changes in how the vocal system revisits acoustic states over time. Using the depression subset of the DAIC-WOZ corpus with 142 labeled participants, we modeled frame-level COVAREP trajectories as nonlinear dynamical systems and derived recurrence-based biomarkers from 74 vocal channels. Logistic regression with feature selection and stratified cross-validation evaluated classification performance. Recurrence-based biomarkers achieved a mean cross-validated AUC of 0.689, exceeding static acoustic baselines, entropy-dynamics features, Hurst exponent features, determinism features, and Lyapunov-like instability proxies. Permutation testing indicated statistical significance with $p=0.004$. Pooled cross-validated predictions yielded AUC 0.665 with a 95\% bootstrap confidence interval of [0.568, 0.758]. These findings suggest that depression may be characterized by altered recurrence structure in conversational vocal dynamics and support nonlinear state-space analysis as a promising direction for digital psychiatric biomarkers.

0

cs.SD 2026-04-29

3D hierarchy generates orchestral music with harmony control

SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

A cascading decoder plus beat-quantized skeleton yields cleaner harmony and higher listener preference than prior models.

abstract click to expand

Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We present SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestration. SymphonyGen employs a cascading decoder architecture that decomposes the Bar, Track, and Event axes, improving computational efficiency and scalability over conventional 1D or 2D models. We introduce "short-score" conditioning via a beat-quantized multi-voice harmony skeleton, enabling outline control while preserving textural diversity. The model is further refined using Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward, aligning symbolic output with modern acoustic expectations. Additionally, we implement a dissonance-averse sampling algorithm to suppress unintended tonal clashes during inference. Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression. Subjective evaluations demonstrate that SymphonyGen outperforms baselines in musicality and preference for orchestral music generation. Demo page: https://symphonygen.github.io/

1 0

0

cs.SD 2026-04-29

PSP shows TTS accent rankings diverge from WER on Indic languages

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

Per-dimension scores on retroflex, aspiration and prosody reveal no system leads across all six accent features for Hindi, Telugu and Tamil.

abstract click to expand

Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5->R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi < Telugu < Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering -- commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.

1 0

0

cs.SD 2026-04-29

Frozen base TTS matches commercial Indic output via prompt recovery

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

BUPS romanization, LoRA on text predictor, and Config B sampling deliver low phoneme collapse on Telugu, Tamil, Hindi with licensed data

abstract click to expand

Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.

1 0

0

cs.SD 2026-04-29

Speaker-adaptive network lifts conversation emotion accuracy

ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

Three-stage process calibrates audio-visual features to neutral space, gates modalities by identity, and regularizes outputs, improving MELD

abstract click to expand

To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotions differently. In our daily lives, we can see this. When communicating with different people, some express "happiness" through their facial expressions and words, while others may hide their happiness or express it through their actions. Both are expressions of 'happiness,' but such differences in emotional expression are still too difficult for machines to distinguish. Current emotion recognition remains at a 'static' level, using a single recognition model to identify all emotional styles. This "simplification" often affects the recognition results, especially in multi-turn dialogues. To address this problem, this paper introduces a novel Multi-Level Speaker Adaptive Network (ML-SAN), which, specifically, effectively addresses the challenge of speaker identity information confusion. ML-SAN does not simply assign a speaker's ID after recognition; instead, it employs a three-stage adaptive process: First, Input-level Calibration uses Feature-Level Linear Modulation (FiLM) to adjust the raw audio and visual features into a neutral space unrelated to the speaker. Then, Interaction-level Gating re-adjusts the trust level for each modality (e.g., voice or facial features) based on the speaker's identity information. Finally, Output-level Regularization maintains the consistency of speaker features in the latent space. Tests on the MELD and IEMOCAP datasets show that our model (ML-SAN) achieves better results, performs exceptionally well in handling challenging tail sentiment categories, and better addresses the diversity of speakers in real-world scenarios.

0

cs.SD 2026-04-29

Feedback loops let AI instruments co-create music with humans

Hu\'i S\`u: Co-constructing a Dual Feedback Apparatus

Sù and Agentier recursively influence their own audio and control signals, turning performance into a negotiation of agency.

abstract click to expand

This performance presents a duet between two intelligent musical instruments, S\`u (to trace back; to go upstream) and Agentier (playing on agentic clavier), and their human performers, connected through feedback loops. Rather than treating AI as a tool that responds predictably to input, both systems operate recursively, where past actions continuously influence future behaviour. The S\`u operates in the audio space through latent representation. Its performer uses Make Noise 0-series synthesisers and MIDI controllers to work with a neural feedback synthesis system based on a RAVE model, with a latent feedback loop embedded within the model's internal structure. This allows the instrument to remember and reuse its own internal states, influencing ongoing sound generation through its recent sonic history. The Agentier functions in the control space. Its performer interacts with the system using a Roland S-1 synthesiser and Keith McMillen QuNeo touchpad, where control gestures are routed into a recurrent neural network that feeds back into the synthesis process. Through this feedback loop, the system actively shapes the evolution of control signals over time. Contrasting feedback in the audio and control domains, the performance explores shared agency, resistance, and negotiation between humans and intelligent musical systems. Musical phenomena are co-produced through the entangled states of interaction, rather than through pre-existing system configuration or fixed mappings.

0

cs.SD 2026-04-28

Models keep 60-72% of audio scores with no sound input

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Diagnostic tests show most benchmark questions can be answered from text alone or short audio fragments.

abstract click to expand

Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.

1 0

0

cs.SD 2026-04-28

Segment-level prediction cuts oversegmentation in chord recognition

An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization

Auto-regressive modeling at learned boundaries boosts accuracy on complex non-triad chords that current systems miss.

abstract click to expand

Automatic chord recognition (ACR) extracts time-aligned chord labels from music audio recordings. Despite recent advances, ACR still struggles with oversegmentation, data scarcity, and imbalance, especially in recognizing complex chords such as non-triads, which are unpopular in existing datasets. To address these challenges, we reformulate ACR as a segment-level sequence-to-sequence prediction task, where chord sequences are predicted auto-regressively rather than frame by frame. This design mitigates excessive segmentation by detecting chord changes only at segment boundaries. We further introduce two types of token representations and an encoder pre-training method, both specifically designed for time-aligned chord modeling. Experimental results show that our model improves performance in both chord recognition and segmentation, with notable gains for complex and infrequent chord types. These findings demonstrate the effectiveness of segment-level sequence modeling, structured tokenization, and representation learning for advancing chord recognition systems.

1 0

0

cs.SD 2026-04-28

ASR models learn to skip uncertain words for higher reliability

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

RAS metric, tuned to human judgment, lets systems abstain on doubtful segments while keeping word error rates competitive.

abstract click to expand

Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.

1 0

0

cs.SD 2026-04-28

The paper proposes DriftSE, a generative speech enhancement framework that uses a…

Speech Enhancement Based on Drifting Models

DriftSE formulates speech denoising as an equilibrium problem solved in one step via a learned drifting field that matches distributions…

abstract click to expand

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

1 0

0

cs.SD 2026-04-28 2 theorems

One-step drifting field matches noisy speech to clean distributions

Speech Enhancement Based on Drifting Models

DriftSE evolves pushforward distributions in a single pass, outperforming multi-step diffusion baselines on VoiceBank-DEMAND without paired,

abstract click to expand

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

1 0

0

cs.SD 2026-04-27

CNN detects Hindi keywords at 91.79 percent accuracy

Keyword spotting using convolutional neural network for speech recognition in Hindi

MFCC features fed to a neural network enable efficient on-device spotting in spoken Hindi.

abstract click to expand

In this study, we investigate the application of keyword spotting (KWS) in the domain of Hindi speech recognition, utilizing a dataset comprising 40,000 audio samples. With a sampling rate of 44 kHz and an average duration of 1.9 seconds per sample, we focus on developing an efficient on-device KWS system tailored for user-specific queries. Leveraging Convolutional Neural Networks (CNNs) for classification, we employ feature engineering techniques to convert raw audio recordings into Mel Frequency Cepstral Coefficients (MFCCs) as an input for our network. Our experiments encompass various CNN architectures, exploring their efficacy in identifying predefined keywords within the continuous speech stream. Our CNN-based approach achieves a commendable accuracy rate of 91.79% through rigorous evaluation, demonstrating promising performance while ensuring computational efficiency and user-specific customization in Hindi speech recognition.

0