archive

Every paper Pith has read. Search by title, abstract, or pith.

240 papers in eess.AS · page 1

cs.SD 2026-05-14 reviewed

SpeakerLLM turns speaker verification into natural-language reasoning
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

Ha-Jin Yu +4
eess.AS 2026-05-13 reviewed

Benchmark standardizes early Parkinson's speech detection
A Benchmark for Early-stage Parkinson's Disease Detection from Speech

Bastiaan R. Bloem +5
eess.AS 2026-05-13 reviewed

Framework filters FSD50K to single-source audio clips
FSD50K-Solo: Automated Curation of Single-Source Sound Events

Bryce Irvin +6
eess.AS 2026-05-12 reviewed

SMC dataset exposes tempo bias in state-of-the-art beat tracking models
The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking

Jaehoon Ahn +2
cs.SD 2026-05-12 reviewed

STRUM turns raw audio into playable rhythm charts at 0.84 F1 for drums
STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

Joshua Opria
eess.AS 2026-05-12 reviewed

Modern ASR matches humans on enhanced speech but misleads on quality
Too Good to Be True: A Study on Modern Automatic Speech Recognition for the Evaluation of Speech Enhancement

Danilo de Oliveira +2
eess.AS 2026-05-12 reviewed

FM-Speech outperforms rivals on 14 fine-grained speech dimensions
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

Chuan Xie +11
eess.AS 2026-05-12 reviewed

Chunkwise Aligner matches Transducer accuracy at lower cost
Chunkwise Aligners for Streaming Speech Recognition

Masato Mimura +2
eess.SP 2026-05-11 reviewed

Lanczos Krylov method matches exact EVD for adaptive diagonal loading
Adaptive Diagonal Loading using Krylov Subspaces for Robust Beamforming

Andrew C. Singer +3
cs.AI 2026-05-11 reviewed

AVLLMs store audio-visual data in specialized sink tokens
Probing Cross-modal Information Hubs in Audio-Visual LLMs

Chaeyoung Jung +3
cs.AI 2026-05-11 reviewed

Cross-modal sink tokens store audio-visual info in AVLLMs
Probing Cross-modal Information Hubs in Audio-Visual LLMs

Chaeyoung Jung +3
eess.AS 2026-05-11 reviewed

Flow matching reconstructs sound fields from few microphones
SF-Flow: Sound field magnitude estimation via flow matching guided by sparse measurements

Ege Erdem +4
cs.SD 2026-05-11 reviewed

Acoustic priors sharpen timbre edits in polyphonic music
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

Boyu Cao +4
cs.CL 2026-05-11 reviewed

Direct user routing improves spoken QA but risks incoherent interruptions
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

Hui Lu +6
eess.AS 2026-05-11 reviewed

Disentangling power doubles audio generation convergence speed
PoDAR: Power-Disentangled Audio Representation for Generative Modeling

Alejandro Luebs +7
eess.AS 2026-05-10 reviewed

Late reverberation tail reveals same-room source location from one mic
Single-Microphone Audio Point Source Discriminative Localization From Reverberation Late Tail Estimation

Matthew Maciejewski
eess.AS 2026-05-10 reviewed

Challenge shows audio deepfake detectors still fail after media changes
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations

Hieu-Thi Luong +4
eess.AS 2026-05-10 reviewed

Context model outperforms speech evaluators on appropriateness
Evaluating the Expressive Appropriateness of Speech in Rich Contexts

Cheng Gong +28
eess.AS 2026-05-10 reviewed

Kinetic schedule and moment fix lift zero-shot TTS
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

Dong Yang +4
cs.CL 2026-05-09 reviewed

Temperature sampling lifts Chinese dialect ASR accuracy
Dolphin-CN-Dialect: Where Chinese Dialects Matter

Guanbo Wang +8
eess.AS 2026-05-09 reviewed

Distillation cuts hallucinations in LM-based speech enhancement
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

Hang Su +8
eess.AS 2026-05-08 reviewed

Keyed rotations watermark speech in codec latent spaces
Latent Secret Spin: Keyed Orthogonal Rotations for Blind Speech Watermarking in Anisotropic Latent Spaces

Antonio Faonio +4
cs.LG 2026-05-08 reviewed

Mapping imagined MEG to listened signals decodes unspoken words
Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

Maryam Maghsoudi +1
eess.AS 2026-05-08 reviewed

Distance model switches from reverberation to delay with calibration
Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation

Archontis Politis +2
eess.AS 2026-05-08 reviewed

Rank metric reveals voice anonymisation leaks EER overlooks
Evaluating voice anonymisation using similarity rank disclosure

Dorothea Kolossa +9
cs.CR 2026-05-08 reviewed

Phase-coded audio watermark verifies at 98% after attacks
Asymmetric Phase Coding Audio Watermarking

Amir Ghasemian +3
cs.CL 2026-05-07 reviewed

MIST benchmark shows LLMs lag on voice IoT tasks
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Alexandros Papangelis +5
eess.AS 2026-05-07 reviewed

Protocol approves audio compression only for worst query families
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models

Amir Ivry
eess.IV 2026-05-07 reviewed

Neural codec with FFT encoder outperforms tokenizers on sensors
LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

Dan Jacobellis +1
cs.LG 2026-05-07 reviewed

Weight decay induces Villani coercivity in Transformer losses
Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

Abhijit Das +1
eess.AS 2026-05-07 reviewed

Compact latent unifies speech understanding and generation
WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

Guanrou Yang +14
eess.AS 2026-05-07 reviewed

Decomposing interpolants boosts speech enhancement quality
Predictive-Generative Drift Decomposition for Speech Enhancement and Separation

Christoph Boeddeker +5
eess.AS 2026-05-07 reviewed

NDF+ adds control over diffuse sound in virtual microphone outputs
NDF+: Joint Neural Directional Filtering and Diffuse Sound Extraction

Emanu\"el A. P. Habets +3
cs.CL 2026-05-07 reviewed

Prosody embeddings at input cut speech LLM modality gap
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

Daxin Tan +4
cs.CL 2026-05-07 reviewed

Input prosody alignment shrinks speech LLM modality gap
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

Daxin Tan +4
cs.SD 2026-05-07 reviewed

0.4B model clones voices across 30 languages without transcripts
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Berrak Sisman +12
cs.SD 2026-05-07 reviewed

0.4B model clones any voice across 30 languages zero-shot
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Berrak Sisman +12
eess.AS 2026-05-07 reviewed

Learned Riemannian costs raise audio distance correlation with human ratings
Optimal Transport Audio Distance with Learned Riemannian Ground Metrics

Wonwoo Jeong
eess.AS 2026-05-06 reviewed

Neural net creates virtual mics to nearly match full-array performance
Spatial-Magnifier: Spatial upsampling for multichannel speech enhancement

Ashutosh Pandey +6
cs.SD 2026-05-06 reviewed

Bangla ASR hits 0.2441 WER after Whisper fine-tuning
Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

Ahmed Faizul Haque Dhrubo +6
eess.AS 2026-05-06 reviewed

Instruction-tuned model matches human audio ratings without retraining
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

Bach Viet Do +4
cs.SD 2026-05-05 reviewed

Web tool maps ship underwater noise worldwide in near real time
ShipEcho -- An Interactive Tool for Global Mapping of Underwater Radiated Noise from Vessels

{\DJ}ula Na{\dj} +3
cs.SD 2026-05-05 reviewed

0.1B omni model reaches 0.09 CER in speech-text consistency
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Jingyao Gong
eess.AS 2026-05-05 reviewed

Classical codecs resist noise better than neural ones
Assessing the Impact of Noise and Speech Enhancement on the Intelligibility of Speech Codecs

Anjana Rajasekhar +4
eess.AS 2026-05-05 reviewed

Diffusion model beats top echo canceller with less compute
DiffVQE: Hybrid Diffusion Voice Quality Enhancement Under Acoustic Echo and Noise

Ernst Seidel +4
eess.AS 2026-05-05 reviewed

Entropy minimization decomposes for autoregressive test-time adaptation
Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models

Chee-En Yu +3
cs.SD 2026-05-04 reviewed

Phoneme checks detect emotional deepfakes
Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings

Anderson R. Avila +2
eess.AS 2026-05-04 reviewed

Simple pitch and noise checks catch 85% of bad voice clones
Low-Cost Detection of Degraded Voice Clones via Source-Output Acoustic Consistency

Jana Shokr +3
eess.AS 2026-05-04 reviewed

Partitioned speech vectors allow searches that ignore speaker or focus on words
Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

Jens Edlund +1
eess.AS 2026-05-04 reviewed

Speech embeddings split by attribute for selective similarity searches
Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

Jens Edlund +1