hub Tool reference

Sorokin, and et al

SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing · 2019 · arXiv 1912.06670

Tool reference. 80% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

20 Pith papers citing it

Method reference 80% of classified citations

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 3 background 1 method 1

citation-polarity summary

use dataset 3 background 1 use method 1

representative citing papers

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

cs.CL · 2026-05-28 · accept · novelty 7.0

HEALTHDIAL is a multilingual multi-parallel spoken dialogue dataset containing 1,500 dialogues per language grounded in WHO content, with recorded speech and speaker metadata across four languages.

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

cs.SD · 2026-05-15 · unverdicted · novelty 7.0

ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.

Membership Inference for Contrastive Pre-training Models with Text-only PII Queries

cs.CR · 2026-03-15 · unverdicted · novelty 7.0

UMID infers membership in contrastive pre-training data using only text queries by performing latent inversion and comparing similarity and variability signals to synthetic gibberish references via unsupervised anomaly detection.

High Fidelity Neural Audio Compression

eess.AS · 2022-10-24 · accept · novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.

BlasBench: An Open Benchmark for Irish Speech Recognition

cs.CL · 2026-04-12 · conditional · novelty 6.0

BlasBench supplies an Irish-aware normalizer and scoring harness that enables reproducible ASR comparisons and exposes a 33-43 point generalization gap for fine-tuned models versus 7-10 points for massively multilingual ones.

Two-Dimensional Quantization for Geometry-Aware Audio Coding

cs.SD · 2025-12-01 · unverdicted · novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

cs.CL · 2025-09-26 · unverdicted · novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.

A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset

cs.CR · 2025-06-20 · unverdicted · novelty 6.0

An empirical audit of one web-scraped ML training dataset reveals persistent PII after sanitization, which the authors combine with legal analysis to highlight privacy risks and advocate redefining 'publicly available' data for AI training.

SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

cs.SD · 2025-05-30 · unverdicted · novelty 6.0

SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

cs.SD · 2025-05-23 · unverdicted · novelty 6.0

CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

eess.AS · 2024-06-04 · unverdicted · novelty 6.0

Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.

HARNESS: Lightweight Distilled Arabic Speech Foundation Models

eess.AS · 2026-03-31 · accept · novelty 5.5

HARNESS introduces Arabic-centric speech foundation models that achieve high efficiency and performance through iterative self-distillation and PCA-based signal compression.

Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs

cs.CL · 2026-05-02 · unverdicted · novelty 5.0

Incidental multilingualism from uneven web training makes LLMs unequal, brittle, and opaque across languages.

Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR

cs.CL · 2026-04-15 · unverdicted · novelty 5.0

Combining LLM-based elderly-contextual paraphrasing with TTS synthesis using elderly speakers reduces word error rates in elderly ASR by up to 58% over standard Whisper baselines on English and Korean datasets.

IQRA 2026: Interspeech Challenge on Automatic Pronunciation Assessment for Modern Standard Arabic (MSA)

cs.SD · 2026-03-31 · unverdicted · novelty 5.0

The IQRA 2026 challenge on Arabic mispronunciation detection reports a 0.28 F1-score gain from new authentic human error data and diverse modeling approaches including self-supervised and audio-language models.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

eess.AS · 2024-10-09 · unverdicted · novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

eess.AS · 2026-04-14 · unverdicted · novelty 4.0

Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.

Non-Intrusive Automatic Speech Recognition Refinement: A Survey

eess.AS · 2025-08-10 · accept · novelty 4.0

A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.

Keyword spotting using convolutional neural network for speech recognition in Hindi

cs.SD · 2026-04-26 · unverdicted · novelty 2.0

CNNs using MFCC features achieve 91.79% accuracy for keyword spotting in Hindi speech on a 40,000-sample dataset.

citing papers explorer

Showing 20 of 20 citing papers.

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking cs.CL · 2026-05-28 · accept · none · ref 1
HEALTHDIAL is a multilingual multi-parallel spoken dialogue dataset containing 1,500 dialogues per language grounded in WHO content, with recorded speech and speaker metadata across four languages.
Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues cs.SD · 2026-05-15 · unverdicted · none · ref 24
ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.
Membership Inference for Contrastive Pre-training Models with Text-only PII Queries cs.CR · 2026-03-15 · unverdicted · none · ref 51
UMID infers membership in contrastive pre-training data using only text queries by performing latent inversion and comparing similarity and variability signals to synthetic gibberish references via unsupervised anomaly detection.
High Fidelity Neural Audio Compression eess.AS · 2022-10-24 · accept · none · ref 2
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
BlasBench: An Open Benchmark for Irish Speech Recognition cs.CL · 2026-04-12 · conditional · none · ref 4
BlasBench supplies an Irish-aware normalizer and scoring harness that enables reproducible ASR comparisons and exposes a 33-43 point generalization gap for fine-tuned models versus 7-10 points for massively multilingual ones.
Two-Dimensional Quantization for Geometry-Aware Audio Coding cs.SD · 2025-12-01 · unverdicted · none · ref 9
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs cs.CL · 2025-09-26 · unverdicted · none · ref 4
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset cs.CR · 2025-06-20 · unverdicted · none · ref 11
An empirical audit of one web-scraped ML training dataset reveals persistent PII after sanitization, which the authors combine with legal analysis to highlight privacy risks and advocate redefining 'publicly available' data for AI training.
SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization cs.SD · 2025-05-30 · unverdicted · none · ref 49
SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training cs.SD · 2025-05-23 · unverdicted · none · ref 52
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models eess.AS · 2024-06-04 · unverdicted · none · ref 13
Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.
HARNESS: Lightweight Distilled Arabic Speech Foundation Models eess.AS · 2026-03-31 · accept · none · ref 2
HARNESS introduces Arabic-centric speech foundation models that achieve high efficiency and performance through iterative self-distillation and PCA-based signal compression.
Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs cs.CL · 2026-05-02 · unverdicted · none · ref 115
Incidental multilingualism from uneven web training makes LLMs unequal, brittle, and opaque across languages.
Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR cs.CL · 2026-04-15 · unverdicted · none · ref 8
Combining LLM-based elderly-contextual paraphrasing with TTS synthesis using elderly speakers reduces word error rates in elderly ASR by up to 58% over standard Whisper baselines on English and Korean datasets.
IQRA 2026: Interspeech Challenge on Automatic Pronunciation Assessment for Modern Standard Arabic (MSA) cs.SD · 2026-03-31 · unverdicted · none · ref 32
The IQRA 2026 challenge on Arabic mispronunciation detection reports a 0.28 F1-score gain from new authentic human error data and diverse modeling approaches including self-supervised and audio-language models.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 1
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 81
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions eess.AS · 2026-04-14 · unverdicted · none · ref 31
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
Non-Intrusive Automatic Speech Recognition Refinement: A Survey eess.AS · 2025-08-10 · accept · none · ref 129
A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.
Keyword spotting using convolutional neural network for speech recognition in Hindi cs.SD · 2026-04-26 · unverdicted · none · ref 12
CNNs using MFCC features achieve 91.79% accuracy for keyword spotting in Hindi speech on a 40,000-sample dataset.

Sorokin, and et al

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer