archive
Every paper Pith has read. Search by title, abstract, or pith.
240 papers in eess.AS · page 1
-
SpeakerLLM turns speaker verification into natural-language reasoning
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
-
Benchmark standardizes early Parkinson's speech detection
A Benchmark for Early-stage Parkinson's Disease Detection from Speech
-
Framework filters FSD50K to single-source audio clips
FSD50K-Solo: Automated Curation of Single-Source Sound Events
-
SMC dataset exposes tempo bias in state-of-the-art beat tracking models
The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking
-
STRUM turns raw audio into playable rhythm charts at 0.84 F1 for drums
STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts
-
Modern ASR matches humans on enhanced speech but misleads on quality
Too Good to Be True: A Study on Modern Automatic Speech Recognition for the Evaluation of Speech Enhancement
-
FM-Speech outperforms rivals on 14 fine-grained speech dimensions
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
-
Chunkwise Aligner matches Transducer accuracy at lower cost
Chunkwise Aligners for Streaming Speech Recognition
-
Lanczos Krylov method matches exact EVD for adaptive diagonal loading
Adaptive Diagonal Loading using Krylov Subspaces for Robust Beamforming
-
AVLLMs store audio-visual data in specialized sink tokens
Probing Cross-modal Information Hubs in Audio-Visual LLMs
-
Cross-modal sink tokens store audio-visual info in AVLLMs
Probing Cross-modal Information Hubs in Audio-Visual LLMs
-
Flow matching reconstructs sound fields from few microphones
SF-Flow: Sound field magnitude estimation via flow matching guided by sparse measurements
-
Acoustic priors sharpen timbre edits in polyphonic music
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
-
Direct user routing improves spoken QA but risks incoherent interruptions
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
-
Disentangling power doubles audio generation convergence speed
PoDAR: Power-Disentangled Audio Representation for Generative Modeling
-
Late reverberation tail reveals same-room source location from one mic
Single-Microphone Audio Point Source Discriminative Localization From Reverberation Late Tail Estimation
-
Challenge shows audio deepfake detectors still fail after media changes
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
-
Context model outperforms speech evaluators on appropriateness
Evaluating the Expressive Appropriateness of Speech in Rich Contexts
-
Kinetic schedule and moment fix lift zero-shot TTS
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
-
Temperature sampling lifts Chinese dialect ASR accuracy
Dolphin-CN-Dialect: Where Chinese Dialects Matter
-
Distillation cuts hallucinations in LM-based speech enhancement
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation
-
Keyed rotations watermark speech in codec latent spaces
Latent Secret Spin: Keyed Orthogonal Rotations for Blind Speech Watermarking in Anisotropic Latent Spaces
-
Mapping imagined MEG to listened signals decodes unspoken words
Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping
-
Distance model switches from reverberation to delay with calibration
Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation
-
Rank metric reveals voice anonymisation leaks EER overlooks
Evaluating voice anonymisation using similarity rank disclosure
-
Phase-coded audio watermark verifies at 98% after attacks
Asymmetric Phase Coding Audio Watermarking
-
MIST benchmark shows LLMs lag on voice IoT tasks
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
-
Protocol approves audio compression only for worst query families
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models
-
Neural codec with FFT encoder outperforms tokenizers on sensors
LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation
-
Weight decay induces Villani coercivity in Transformer losses
Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization
-
Compact latent unifies speech understanding and generation
WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling
-
Decomposing interpolants boosts speech enhancement quality
Predictive-Generative Drift Decomposition for Speech Enhancement and Separation
-
NDF+ adds control over diffuse sound in virtual microphone outputs
NDF+: Joint Neural Directional Filtering and Diffuse Sound Extraction
-
Prosody embeddings at input cut speech LLM modality gap
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
-
Input prosody alignment shrinks speech LLM modality gap
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
-
0.4B model clones voices across 30 languages without transcripts
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
-
0.4B model clones any voice across 30 languages zero-shot
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
-
Learned Riemannian costs raise audio distance correlation with human ratings
Optimal Transport Audio Distance with Learned Riemannian Ground Metrics
-
Neural net creates virtual mics to nearly match full-array performance
Spatial-Magnifier: Spatial upsampling for multichannel speech enhancement
-
Bangla ASR hits 0.2441 WER after Whisper fine-tuning
Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization
-
Instruction-tuned model matches human audio ratings without retraining
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
-
Web tool maps ship underwater noise worldwide in near real time
ShipEcho -- An Interactive Tool for Global Mapping of Underwater Radiated Noise from Vessels
-
0.1B omni model reaches 0.09 CER in speech-text consistency
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
-
Classical codecs resist noise better than neural ones
Assessing the Impact of Noise and Speech Enhancement on the Intelligibility of Speech Codecs
-
Diffusion model beats top echo canceller with less compute
DiffVQE: Hybrid Diffusion Voice Quality Enhancement Under Acoustic Echo and Noise
-
Entropy minimization decomposes for autoregressive test-time adaptation
Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models
-
Phoneme checks detect emotional deepfakes
Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings
-
Simple pitch and noise checks catch 85% of bad voice clones
Low-Cost Detection of Degraded Voice Clones via Source-Output Acoustic Consistency
-
Partitioned speech vectors allow searches that ignore speaker or focus on words
Multi-Axis Speech Similarity via Factor-Partitioned Embeddings
-
Speech embeddings split by attribute for selective similarity searches
Multi-Axis Speech Similarity via Factor-Partitioned Embeddings