archive
Every paper Pith has read. Search by title, abstract, or pith.
375 papers in cs.SD · page 1
-
SpeakerLLM turns speaker verification into natural-language reasoning
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
-
Drum MIDI becomes audio matching any reference timbre
Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis
-
Sonification lifts eye surgery event detection from 61 to 83 percent
Physics-Based iOCT Sonification for Real-time Interaction Awareness in Subretinal Injection
-
Calculus finds optimal vocabulary size for ASR
A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR
-
The paper introduces a framework that refines pseudo-audio prompts by explicitly modeling…
Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR
-
Masked contrastive pairs beat audio reconstruction
AudioMosaic: Contrastive Masked Audio Representation Learning
-
Benchmark standardizes early Parkinson's speech detection
A Benchmark for Early-stage Parkinson's Disease Detection from Speech
-
General audio pretraining beats domain MAE for bioacoustics
Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study
-
AI agents speed creation of digital music instruments
Case Studies and Reflections on Agentic Software Engineering for Rapid Development of Digital Music Instruments
-
No voice agent tops 0.5 on both accuracy and experience
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
-
Oscillatory memory gates audio AI to salient events
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating
-
Two-stage system turns text into playable sheet music
Text2Score: Generating Sheet Music From Textual Prompts
-
PCA diffusion beats regression for symbolic drum audio
Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering
-
Hybrid Whisper model detects speaker confidence at 0.75 Macro-F1
A Semi-Supervised Framework for Speech Confidence Detection using Whisper
-
Singing conversion works on accompanied tracks without vocal cleanup
Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling
-
SMC dataset exposes tempo bias in state-of-the-art beat tracking models
The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking
-
STRUM turns raw audio into playable rhythm charts at 0.84 F1 for drums
STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts
-
Closed-loop AI boosts quality of long audio stories
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling
-
Lanczos Krylov method matches exact EVD for adaptive diagonal loading
Adaptive Diagonal Loading using Krylov Subspaces for Robust Beamforming
-
Token swaps edit speaker identity in compressed audio
Exploring Token-Space Manipulation in Latent Audio Tokenizers
-
Codec keeps emotional cues in compressed speech
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
-
Multi-layer attention probes improve bioacoustic encoder evaluation
Multi-layer attentive probing improves transfer of audio representations for bioacoustics
-
Transformer predicts codec tokens to synthesize drums from MIDI grids
Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs
-
Cold diffusion cleans reverb from drum signals
A Cold Diffusion Approach for Percussive Dereverberation
-
Acoustic priors sharpen timbre edits in polyphonic music
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
-
APEX explains audio classifiers with four prototype views
APEX: Audio Prototype EXplanations for Classification Tasks
-
Disentangling power doubles audio generation convergence speed
PoDAR: Power-Disentangled Audio Representation for Generative Modeling
-
Deep models extract voice biomarkers for depression at 71% sensitivity
Voice Biomarkers for Depression and Anxiety
-
Separate reasoning cuts interference in audio-visual LLMs
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
-
System turns Chladni patterns into real-time sound with 99% accuracy
ChladniSonify: A Visual-Acoustic Mapping Method for Chladni Patterns in New Media Art Creation
-
Joint diffusion remixes timbres across all stems in a mixture at once
Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems
-
Reddit music chats become 190k Deezer-grounded dialogues
Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation
-
Per-gender thresholds cut deepfake detector bias by 54-75%
Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias
-
Audio-first search benchmark caps top model at 43%
Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
-
Unison aligns motion, speech and sound in video generation
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
-
DP segments audio to reset beamformer covariance on the fly
Online Segmented Beamforming via Dynamic Programming
-
Unsupervised tokens split bee hives by queen status
BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing
-
Multi-scale dilated encoder improves closed-set speaker ID
TARNet: A Temporal-Aware Multi-Scale Architecture for Closed-Set Speaker Identification
-
Distance model switches from reverberation to delay with calibration
Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation
-
Decomposed stages yield better chord variety and rules compliance
A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation
-
Audio-video models fail to keep physics consistent in transitions
Do Joint Audio-Video Generation Models Understand Physics?
-
MIST benchmark shows LLMs lag on voice IoT tasks
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
-
PianoCoRe combines sources into 157k aligned piano performances
PianoCoRe: Combined and Refined Piano MIDI Dataset
-
Self-alignment cuts audio token count 55% while keeping edit-distance search
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
-
Quantum spectrogram patches reach 0.87 AUROC in audio deepfake tests
Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features
-
Melody and rhythm show no diversity correlation across cultures
Do Melody and Rhythm Coevolve?
-
Input prosody alignment shrinks speech LLM modality gap
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
-
Prosody embeddings at input cut speech LLM modality gap
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
-
0.4B model clones voices across 30 languages without transcripts
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
-
0.4B model clones any voice across 30 languages zero-shot
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning