ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
hub Mixed citations
author Zhou, A
Mixed citation behavior. Most common role is background (57%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ciwGAN and fiwGAN models trained on isolated words spontaneously generate concatenated multi-word outputs and display early compositionality precursors.
Sarashina2.2-TTS achieves SOTA kanji reading accuracy via data scaling and Joyo-kanji-targeted synthesis, introduces the Joyo Kanji Yomi Benchmark and Kana-CER metric, and shows stable cross-lingual performance.
Latent profile analysis of 1,174 Reddit users identifies four self-stigma personas in PWUD; sequential classifiers reach macro-F1 0.74, but eight clinical experts rate generic LLM empathy higher than persona-conditioned responses despite the latter driving targeted shifts.
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
Predictive Entropy Maximization performs competitive blind source separation using only local error-driven and Hebbian updates derived from a surrogate entropy objective with spectral error bounds.
KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.
Introduces the LDD task, ListenForge dataset built from five listening head generation methods, and MANet model that detects listening forgeries via motion inconsistencies guided by audio semantics.
Momentum SGD incurs a provable drift-amplification penalty in nonstationary stochastic optimization that makes it worse than vanilla SGD in drift-dominated regimes, confirmed by finite-time upper bounds and minimax lower bounds under gradient-variation constraints.
CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.
TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
SCALE disentangles emotion and cause representations in conversations and uses optimal transport for many-to-many global alignment, achieving SOTA on ECPEC benchmarks.
PRISM-CTG is the first large-scale foundation model for cardiotocography that uses multi-view self-supervised learning on unlabeled data to learn transferable representations, outperforming baselines on seven downstream tasks with external validation.
STAMP adapter enables general time series foundation models to match specialized EEG foundation models on clinical classification tasks across 8 benchmarks while using few trainable parameters.
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
FusionSense uses server-side fusion learning, filter-out-safe labels, and edge compaction to enable runtime-adaptive multimodal sensing that cuts energy up to 33x while preserving task quality on RGB+Depth data.
WorldSpeech supplies 65k hours of multilingual aligned speech data across 76 languages and delivers 63.5% average relative WER reduction after fine-tuning ASR models on 11 typologically diverse languages.
GaborNet replaces sinc functions with Gabor filters in raw-audio neural networks and is tested for audio spoof detection with augmentations in RawNet2 and RawGAT-ST.
R-FLoRA combines Laplacian residual statistics with a frozen vision transformer via gated low-rank adapters, residual fusion, and contrastive alignment to achieve better accuracy and generalization than prior single-image face morphing attack detectors.
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
The paper introduces Recursive QLSTM via metacore recursion, numerically tests variants on sequence lengths, and offers theoretical arguments for better temporal propagation.
Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.
citing papers explorer
-
Recursive QLSTM with Dynamic Variational Quantum Circuit Adaptation
The paper introduces Recursive QLSTM via metacore recursion, numerically tests variants on sequence lengths, and offers theoretical arguments for better temporal propagation.
-
Self-Modulating Quantum Fast-Weight Programmers for Efficient Adaptive Sequential Learning
Self-Modulating QFWP adds adaptive modulation to quantum fast-weight updates and memory to improve stability and performance on sequential learning tasks.