JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
Insights into deep non-linear filters for improved multi-channel speech enhancement,
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 9roles
background 2polarities
background 2representative citing papers
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
EchoAvatar presents a streaming architecture for low-latency full-body animation from incremental audio, with RL refinement and LLM tool-call control, outperforming real-time baselines.
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
RADAR Challenge 2026 organizes a multilingual audio deepfake detection benchmark with media transformations, reporting participation from 33 development and 22 evaluation teams using EER metric.
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
Authors introduce MLM and CLM specialization methods that avoid memorizing identifiers in sensitive training data while aiming for a privacy-utility tradeoff on medical datasets.
MSEA uses a master-slave encoder architecture on patent specifications and claims, enhanced with pointer networks and repetition suppression, to generate better summaries as measured by small ROUGE score gains.
A survey of spatial speech perception systems covering sound source localization, directional enhancement, and automatic speech recognition methods and their integration.
citing papers explorer
-
JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
-
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
-
EchoAvatar: Real-time Generative Avatar Animation from Audio Streams
EchoAvatar presents a streaming architecture for low-latency full-body animation from incremental audio, with RL refinement and LLM tool-call control, outperforming real-time baselines.
-
MOSS-Audio Technical Report
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
-
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
RADAR Challenge 2026 organizes a multilingual audio deepfake detection benchmark with media transformations, reporting participation from 33 development and 22 evaluation teams using EER metric.
-
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
-
Towards the Anonymization of the Language Modeling
Authors introduce MLM and CLM specialization methods that avoid memorizing identifiers in sensitive training data while aiming for a privacy-utility tradeoff on medical datasets.
-
The Master-Slave Encoder Model for Improving Patent Text Summarization: A New Approach to Combining Specifications and Claims
MSEA uses a master-slave encoder architecture on patent specifications and claims, enhanced with pointer networks and repetition suppression, to generate better summaries as measured by small ROUGE score gains.
-
Spatial Speech Perception Systems: A Survey of Sound Source Localization, Directional Enhancement, and Speech Recognition
A survey of spatial speech perception systems covering sound source localization, directional enhancement, and automatic speech recognition methods and their integration.