pith. machine review for the scientific record. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

375 papers in cs.SD · page 1

  1. cs.SD 2026-05-14 reviewed
    SpeakerLLM turns speaker verification into natural-language reasoning

    SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

    Ha-Jin Yu +4

  2. cs.SD 2026-05-14 reviewed
    Drum MIDI becomes audio matching any reference timbre

    Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

    Chihiro Nagashima +11

  3. cs.SD 2026-05-14 reviewed
    Sonification lifts eye surgery event detection from 61 to 83 percent

    Physics-Based iOCT Sonification for Real-time Interaction Awareness in Subretinal Injection

    Andrea K. M. Ross +9

  4. cs.CL 2026-05-14 reviewed
    Calculus finds optimal vocabulary size for ASR

    A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

    Sunil Kumar Kopparapu

  5. cs.SD 2026-05-14 reviewed
    The paper introduces a framework that refines pseudo-audio prompts by explicitly modeling…

    Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

    Ryo Magoshi +2

  6. cs.LG 2026-05-14 reviewed
    Masked contrastive pairs beat audio reconstruction

    AudioMosaic: Contrastive Masked Audio Representation Learning

    Christopher Leckie +5

  7. eess.AS 2026-05-13 reviewed
    Benchmark standardizes early Parkinson's speech detection

    A Benchmark for Early-stage Parkinson's Disease Detection from Speech

    Bastiaan R. Bloem +5

  8. cs.SD 2026-05-13 reviewed
    General audio pretraining beats domain MAE for bioacoustics

    Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

    Grant Van Horn +3

  9. cs.SE 2026-05-13 reviewed
    AI agents speed creation of digital music instruments

    Case Studies and Reflections on Agentic Software Engineering for Rapid Development of Digital Music Instruments

    Matthew John Yee-King

  10. cs.SD 2026-05-13 reviewed
    No voice agent tops 0.5 on both accuracy and experience

    EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

    Anil Madamala +12

  11. cs.SD 2026-05-13 reviewed
    Oscillatory memory gates audio AI to salient events

    NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

    Dick Botteldooren +2

  12. cs.SD 2026-05-13 reviewed
    Two-stage system turns text into playable sheet music

    Text2Score: Generating Sheet Music From Textual Prompts

    Abhinaba Roy +6

  13. cs.SD 2026-05-13 reviewed
    PCA diffusion beats regression for symbolic drum audio

    Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering

    Dimos Makris +3

  14. cs.SD 2026-05-12 reviewed
    Hybrid Whisper model detects speaker confidence at 0.75 Macro-F1

    A Semi-Supervised Framework for Speech Confidence Detection using Whisper

    Adam Wynn +1

  15. cs.SD 2026-05-12 reviewed
    Singing conversion works on accompanied tracks without vocal cleanup

    Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

    Chen Geng +4

  16. eess.AS 2026-05-12 reviewed
    SMC dataset exposes tempo bias in state-of-the-art beat tracking models

    The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking

    Jaehoon Ahn +2

  17. cs.SD 2026-05-12 reviewed
    STRUM turns raw audio into playable rhythm charts at 0.84 F1 for drums

    STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

    Joshua Opria

  18. cs.SD 2026-05-12 reviewed
    Closed-loop AI boosts quality of long audio stories

    AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

    Baoxiang Li +5

  19. eess.SP 2026-05-11 reviewed
    Lanczos Krylov method matches exact EVD for adaptive diagonal loading

    Adaptive Diagonal Loading using Krylov Subspaces for Robust Beamforming

    Andrew C. Singer +3

  20. cs.SD 2026-05-11 reviewed
    Token swaps edit speaker identity in compressed audio

    Exploring Token-Space Manipulation in Latent Audio Tokenizers

    Cem Subakan +3

  21. cs.SD 2026-05-11 reviewed
    Codec keeps emotional cues in compressed speech

    AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

    Hongfei Du +5

  22. cs.SD 2026-05-11 reviewed
    Multi-layer attention probes improve bioacoustic encoder evaluation

    Multi-layer attentive probing improves transfer of audio representations for bioacoustics

    Aza Raskin +17

  23. cs.SD 2026-05-11 reviewed
    Transformer predicts codec tokens to synthesize drums from MIDI grids

    Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs

    Dimos Makris +3

  24. cs.SD 2026-05-11 reviewed
    Cold diffusion cleans reverb from drum signals

    A Cold Diffusion Approach for Percussive Dereverberation

    Andr\'as Barj\'ak +2

  25. cs.SD 2026-05-11 reviewed
    Acoustic priors sharpen timbre edits in polyphonic music

    Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

    Boyu Cao +4

  26. cs.SD 2026-05-11 reviewed
    APEX explains audio classifiers with four prototype views

    APEX: Audio Prototype EXplanations for Classification Tasks

    Kornel Howil +5

  27. eess.AS 2026-05-11 reviewed
    Disentangling power doubles audio generation convergence speed

    PoDAR: Power-Disentangled Audio Representation for Generative Modeling

    Alejandro Luebs +7

  28. cs.LG 2026-05-11 reviewed
    Deep models extract voice biomarkers for depression at 71% sensitivity

    Voice Biomarkers for Depression and Anxiety

    Colin Vaz +2

  29. cs.AI 2026-05-11 reviewed
    Separate reasoning cuts interference in audio-visual LLMs

    Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

    Chenrui Cui +8

  30. cs.SD 2026-05-11 reviewed
    System turns Chladni patterns into real-time sound with 99% accuracy

    ChladniSonify: A Visual-Acoustic Mapping Method for Chladni Patterns in New Media Art Creation

    Dong Liu +3

  31. cs.SD 2026-05-10 reviewed
    Joint diffusion remixes timbres across all stems in a mixture at once

    Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

    Junchuan Zhao +2

  32. cs.IR 2026-05-09 reviewed
    Reddit music chats become 190k Deezer-grounded dialogues

    Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation

    Haven Kim +1

  33. cs.SD 2026-05-09 reviewed
    Per-gender thresholds cut deepfake detector bias by 54-75%

    Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias

    Aishwarya Fursule +2

  34. cs.SD 2026-05-09 reviewed
    Audio-first search benchmark caps top model at 43%

    Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

    Haopeng Jin +18

  35. cs.CV 2026-05-09 reviewed
    Unison aligns motion, speech and sound in video generation

    Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

    Chi Zhang +8

  36. cs.SD 2026-05-08 reviewed
    DP segments audio to reset beamformer covariance on the fly

    Online Segmented Beamforming via Dynamic Programming

    Andrew C. Singer +4

  37. cs.SD 2026-05-08 reviewed
    Unsupervised tokens split bee hives by queen status

    BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing

    Hamze Hammami +1

  38. cs.SD 2026-05-08 reviewed
    Multi-scale dilated encoder improves closed-set speaker ID

    TARNet: A Temporal-Aware Multi-Scale Architecture for Closed-Set Speaker Identification

    Yassin Terraf +1

  39. eess.AS 2026-05-08 reviewed
    Distance model switches from reverberation to delay with calibration

    Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation

    Archontis Politis +2

  40. cs.SD 2026-05-08 reviewed
    Decomposed stages yield better chord variety and rules compliance

    A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

    Anqi Huang +3

  41. cs.SD 2026-05-08 reviewed
    Audio-video models fail to keep physics consistent in transitions

    Do Joint Audio-Video Generation Models Understand Physics?

    Chenming Ge +10

  42. cs.CL 2026-05-07 reviewed
    MIST benchmark shows LLMs lag on voice IoT tasks

    MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

    Alexandros Papangelis +5

  43. cs.SD 2026-05-07 reviewed
    PianoCoRe combines sources into 157k aligned piano performances

    PianoCoRe: Combined and Refined Piano MIDI Dataset

    Ilya Borovik

  44. cs.LG 2026-05-07 reviewed
    Self-alignment cuts audio token count 55% while keeping edit-distance search

    PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    Adhiraj Banerjee +1

  45. cs.SD 2026-05-07 reviewed
    Quantum spectrogram patches reach 0.87 AUROC in audio deepfake tests

    Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features

    Faisal Quader +4

  46. cs.SD 2026-05-07 reviewed
    Melody and rhythm show no diversity correlation across cultures

    Do Melody and Rhythm Coevolve?

    Harin Lee +5

  47. cs.CL 2026-05-07 reviewed
    Input prosody alignment shrinks speech LLM modality gap

    Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

    Daxin Tan +4

  48. cs.CL 2026-05-07 reviewed
    Prosody embeddings at input cut speech LLM modality gap

    Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

    Daxin Tan +4

  49. cs.SD 2026-05-07 reviewed
    0.4B model clones voices across 30 languages without transcripts

    X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

    Berrak Sisman +12

  50. cs.SD 2026-05-07 reviewed
    0.4B model clones any voice across 30 languages zero-shot

    X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

    Berrak Sisman +12