hub

arXiv preprint arXiv:2509.14128 , year =

Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nikolay Karpov, Jagadeesh Balam, Boris Ginsburg · 2025 · arXiv 2509.14128

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

cs.CL · 2026-06-28 · unverdicted · novelty 7.0

PreferenceASR is a preference-aware ASR test set built from seven corpora that shows model rankings change when user output-style instructions are considered.

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

A fused self-supervised encoder and learned DP decoder for word alignment outperforms MFA on English datasets and generalizes to unseen languages.

Cross-Talk Speech Reduction, by Separation, for Separation

eess.AS · 2026-05-19 · unverdicted · novelty 7.0

Proposes cross-talk reduction task with CTRnet and pseudo-label far-field separation (PuLSS) to train on real close-talk/far-field pairs, achieving SOTA ASR on CHiME-6 and outperforming guided source separation.

AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

cs.CL · 2026-04-30 · unverdicted · novelty 7.0

A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

cs.CL · 2026-03-28 · unverdicted · novelty 7.0

Contextual Earnings-22 is a new benchmark dataset showing that scaled keyword prompting and boosting both deliver significantly better accuracy on custom vocabularies than standard academic tests.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

cs.CL · 2025-12-18 · unverdicted · novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.

What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

Dual-reference benchmarking on atypical stuttered speech reveals disparities in ASR model performance and rankings between verbatim and intended transcriptions.

TRADE: Transducer-Augmented Decoder for Speech LLM

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

TRADE augments multimodal Speech LLMs with a transducer branch for streaming ASR, reporting 6.71% WER offline and 8.40% streaming on the Open ASR Leaderboard from one checkpoint.

Audio Interaction Model

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.

ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

cs.CL · 2026-04-11 · unverdicted · novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.

NPUsper: Eliminating Redundant Computation for Real-Time Whisper on Mobile NPUs

cs.SD · 2026-07-01 · unverdicted · novelty 5.0

NPUsper reduces per-word latency, TTFT, and power for Whisper on mobile NPUs via online hallucination detection and K-step chunk graphs while preserving accuracy.

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 5.0

GlobeAudio is a new multilingual multicultural benchmark for naturalistic evaluation of large audio-language models, showing performance gaps especially for open-source models and low-resource languages.

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

cs.LG · 2026-05-15 · unverdicted · novelty 5.0 · 2 refs

Symphony is a medical-grade speech recognition system that decomposes transcription into specialized components and outperforms existing systems in clinical settings while matching them in general domains.

Assessing the Impact of Noise and Speech Enhancement on the Intelligibility of Speech Codecs

eess.AS · 2026-05-05 · unverdicted · novelty 4.0

Classical codecs prove more robust to noise than neural codecs, speech enhancement significantly helps noise-affected codecs, and listening effort plus ASR-based metrics add useful nuance beyond basic intelligibility scores.

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

cs.CL · 2026-06-15 · unverdicted · novelty 3.0

A cascaded SimulST system using Parakeet and Qwen 3.5 with adaptive black-box policies and RAG context achieves +5.82 XCOMET-XL improvement on En→De for IWSLT 2026.

BUT System Description for CHiME-9 MCoRec Challenge

eess.AS · 2026-04-30 · unverdicted · novelty 3.0

BUT's CHiME-9 MCoRec system conditions Parakeet-v2 ASR on AV-HuBERT visuals for 33.7% WER and uses Qwen3.5 LLM for hierarchical clustering to reach 0.97 F1, beating the baseline by 16.2% WER and 0.15 F1 on the development set.

A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

cs.CL · 2026-06-02 · unverdicted · novelty 2.0

A 1B-parameter multilingual offline model is adapted with AlignAtt policy for simultaneous speech translation and submitted to IWSLT 2026 for three language pairs.

citing papers explorer

Showing 11 of 11 citing papers after filters.

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs cs.CL · 2026-06-28 · unverdicted · none · ref 35
PreferenceASR is a preference-aware ASR test set built from seven corpora that shows model rankings change when user output-style instructions are considered.
Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming cs.CL · 2026-06-09 · unverdicted · none · ref 27
A fused self-supervised encoder and learned DP decoder for word alignment outperforms MFA on English datasets and generalizes to unseen languages.
AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR cs.CL · 2026-04-30 · unverdicted · none · ref 16
A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.
Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild cs.CL · 2026-03-28 · unverdicted · none · ref 30
Contextual Earnings-22 is a new benchmark dataset showing that scaled keyword prompting and boosting both deliver significantly better accuracy on custom vocabularies than standard academic tests.
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs cs.CL · 2025-12-18 · unverdicted · none · ref 87
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR cs.CL · 2026-06-30 · unverdicted · none · ref 25
Dual-reference benchmarking on atypical stuttered speech reveals disparities in ASR model performance and rankings between verbatim and intended transcriptions.
TRADE: Transducer-Augmented Decoder for Speech LLM cs.CL · 2026-06-07 · unverdicted · none · ref 15
TRADE augments multimodal Speech LLMs with a transducer branch for streaming ASR, reporting 6.71% WER offline and 8.40% streaming on the Open ASR Leaderboard from one checkpoint.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models cs.CL · 2026-04-11 · unverdicted · none · ref 16
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models cs.CL · 2026-06-06 · unverdicted · none · ref 76
GlobeAudio is a new multilingual multicultural benchmark for naturalistic evaluation of large audio-language models, showing performance gaps especially for open-source models and low-resource languages.
MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task cs.CL · 2026-06-15 · unverdicted · none · ref 60
A cascaded SimulST system using Parakeet and Qwen 3.5 with adaptive black-box policies and RAG context achieves +5.82 XCOMET-XL improvement on En→De for IWSLT 2026.
A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026 cs.CL · 2026-06-02 · unverdicted · none · ref 12
A 1B-parameter multilingual offline model is adapted with AlignAtt policy for simultaneous speech translation and submitted to IWSLT 2026 for three language pairs.

arXiv preprint arXiv:2509.14128 , year =

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer