archive

Every paper Pith has read. Search by title, abstract, or pith.

375 papers in cs.SD · page 1

cs.SD 2026-05-14 reviewed

SpeakerLLM turns speaker verification into natural-language reasoning
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

Ha-Jin Yu +4
cs.SD 2026-05-14 reviewed

Drum MIDI becomes audio matching any reference timbre
Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

Chihiro Nagashima +11
cs.SD 2026-05-14 reviewed

Sonification lifts eye surgery event detection from 61 to 83 percent
Physics-Based iOCT Sonification for Real-time Interaction Awareness in Subretinal Injection

Andrea K. M. Ross +9
cs.CL 2026-05-14 reviewed

Calculus finds optimal vocabulary size for ASR
A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

Sunil Kumar Kopparapu
cs.SD 2026-05-14 reviewed

The paper introduces a framework that refines pseudo-audio prompts by explicitly modeling…
Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

Ryo Magoshi +2
cs.LG 2026-05-14 reviewed

Masked contrastive pairs beat audio reconstruction
AudioMosaic: Contrastive Masked Audio Representation Learning

Christopher Leckie +5
eess.AS 2026-05-13 reviewed

Benchmark standardizes early Parkinson's speech detection
A Benchmark for Early-stage Parkinson's Disease Detection from Speech

Bastiaan R. Bloem +5
cs.SD 2026-05-13 reviewed

General audio pretraining beats domain MAE for bioacoustics
Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

Grant Van Horn +3
cs.SE 2026-05-13 reviewed

AI agents speed creation of digital music instruments
Case Studies and Reflections on Agentic Software Engineering for Rapid Development of Digital Music Instruments

Matthew John Yee-King
cs.SD 2026-05-13 reviewed

No voice agent tops 0.5 on both accuracy and experience
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Anil Madamala +12
cs.SD 2026-05-13 reviewed

Oscillatory memory gates audio AI to salient events
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

Dick Botteldooren +2
cs.SD 2026-05-13 reviewed

Two-stage system turns text into playable sheet music
Text2Score: Generating Sheet Music From Textual Prompts

Abhinaba Roy +6
cs.SD 2026-05-13 reviewed

PCA diffusion beats regression for symbolic drum audio
Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering

Dimos Makris +3
cs.SD 2026-05-12 reviewed

Hybrid Whisper model detects speaker confidence at 0.75 Macro-F1
A Semi-Supervised Framework for Speech Confidence Detection using Whisper

Adam Wynn +1
cs.SD 2026-05-12 reviewed

Singing conversion works on accompanied tracks without vocal cleanup
Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

Chen Geng +4
eess.AS 2026-05-12 reviewed

SMC dataset exposes tempo bias in state-of-the-art beat tracking models
The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking

Jaehoon Ahn +2
cs.SD 2026-05-12 reviewed

STRUM turns raw audio into playable rhythm charts at 0.84 F1 for drums
STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

Joshua Opria
cs.SD 2026-05-12 reviewed

Closed-loop AI boosts quality of long audio stories
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

Baoxiang Li +5
eess.SP 2026-05-11 reviewed

Lanczos Krylov method matches exact EVD for adaptive diagonal loading
Adaptive Diagonal Loading using Krylov Subspaces for Robust Beamforming

Andrew C. Singer +3
cs.SD 2026-05-11 reviewed

Token swaps edit speaker identity in compressed audio
Exploring Token-Space Manipulation in Latent Audio Tokenizers

Cem Subakan +3
cs.SD 2026-05-11 reviewed

Codec keeps emotional cues in compressed speech
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

Hongfei Du +5
cs.SD 2026-05-11 reviewed

Multi-layer attention probes improve bioacoustic encoder evaluation
Multi-layer attentive probing improves transfer of audio representations for bioacoustics

Aza Raskin +17
cs.SD 2026-05-11 reviewed

Transformer predicts codec tokens to synthesize drums from MIDI grids
Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs

Dimos Makris +3
cs.SD 2026-05-11 reviewed

Cold diffusion cleans reverb from drum signals
A Cold Diffusion Approach for Percussive Dereverberation

Andr\'as Barj\'ak +2
cs.SD 2026-05-11 reviewed

Acoustic priors sharpen timbre edits in polyphonic music
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

Boyu Cao +4
cs.SD 2026-05-11 reviewed

APEX explains audio classifiers with four prototype views
APEX: Audio Prototype EXplanations for Classification Tasks

Kornel Howil +5
eess.AS 2026-05-11 reviewed

Disentangling power doubles audio generation convergence speed
PoDAR: Power-Disentangled Audio Representation for Generative Modeling

Alejandro Luebs +7
cs.LG 2026-05-11 reviewed

Deep models extract voice biomarkers for depression at 71% sensitivity
Voice Biomarkers for Depression and Anxiety

Colin Vaz +2
cs.AI 2026-05-11 reviewed

Separate reasoning cuts interference in audio-visual LLMs
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Chenrui Cui +8
cs.SD 2026-05-11 reviewed

System turns Chladni patterns into real-time sound with 99% accuracy
ChladniSonify: A Visual-Acoustic Mapping Method for Chladni Patterns in New Media Art Creation

Dong Liu +3
cs.SD 2026-05-10 reviewed

Joint diffusion remixes timbres across all stems in a mixture at once
Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

Junchuan Zhao +2
cs.IR 2026-05-09 reviewed

Reddit music chats become 190k Deezer-grounded dialogues
Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation

Haven Kim +1
cs.SD 2026-05-09 reviewed

Per-gender thresholds cut deepfake detector bias by 54-75%
Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias

Aishwarya Fursule +2
cs.SD 2026-05-09 reviewed

Audio-first search benchmark caps top model at 43%
Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

Haopeng Jin +18
cs.CV 2026-05-09 reviewed

Unison aligns motion, speech and sound in video generation
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Chi Zhang +8
cs.SD 2026-05-08 reviewed

DP segments audio to reset beamformer covariance on the fly
Online Segmented Beamforming via Dynamic Programming

Andrew C. Singer +4
cs.SD 2026-05-08 reviewed

Unsupervised tokens split bee hives by queen status
BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing

Hamze Hammami +1
cs.SD 2026-05-08 reviewed

Multi-scale dilated encoder improves closed-set speaker ID
TARNet: A Temporal-Aware Multi-Scale Architecture for Closed-Set Speaker Identification

Yassin Terraf +1
eess.AS 2026-05-08 reviewed

Distance model switches from reverberation to delay with calibration
Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation

Archontis Politis +2
cs.SD 2026-05-08 reviewed

Decomposed stages yield better chord variety and rules compliance
A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

Anqi Huang +3
cs.SD 2026-05-08 reviewed

Audio-video models fail to keep physics consistent in transitions
Do Joint Audio-Video Generation Models Understand Physics?

Chenming Ge +10
cs.CL 2026-05-07 reviewed

MIST benchmark shows LLMs lag on voice IoT tasks
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Alexandros Papangelis +5
cs.SD 2026-05-07 reviewed

PianoCoRe combines sources into 157k aligned piano performances
PianoCoRe: Combined and Refined Piano MIDI Dataset

Ilya Borovik
cs.LG 2026-05-07 reviewed

Self-alignment cuts audio token count 55% while keeping edit-distance search
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

Adhiraj Banerjee +1
cs.SD 2026-05-07 reviewed

Quantum spectrogram patches reach 0.87 AUROC in audio deepfake tests
Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features

Faisal Quader +4
cs.SD 2026-05-07 reviewed

Melody and rhythm show no diversity correlation across cultures
Do Melody and Rhythm Coevolve?

Harin Lee +5
cs.CL 2026-05-07 reviewed

Input prosody alignment shrinks speech LLM modality gap
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

Daxin Tan +4
cs.CL 2026-05-07 reviewed

Prosody embeddings at input cut speech LLM modality gap
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

Daxin Tan +4
cs.SD 2026-05-07 reviewed

0.4B model clones voices across 30 languages without transcripts
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Berrak Sisman +12
cs.SD 2026-05-07 reviewed

0.4B model clones any voice across 30 languages zero-shot
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Berrak Sisman +12