Qwen3-Omni Technical Report
Pith reviewed 2026-05-11 00:15 UTC · model grok-4.3
The pith
Qwen3-Omni maintains state-of-the-art performance on text, image, audio, and video tasks in a single model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Qwen3-Omni maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. It adopts a Thinker-Talker MoE architecture that unifies perception and generation, yielding fluent text and natural real-time speech. Across 36 audio and audio-visual benchmarks, it achieves open-source SOTA on 32 and overall SOTA on 22, outperforming closed-source models like Gemini-2.5-Pro.
What carries the argument
The Thinker-Talker MoE architecture, which separates thinking and talking components to unify multimodal perception and generation, combined with multi-codebook discrete speech codecs for low-latency streaming synthesis.
If this is right
- Matches performance of same-sized single-modal Qwen models on all modalities.
- Excels on audio tasks, leading 32 out of 36 benchmarks.
- Supports text in 119 languages, speech understanding in 19, and generation in 10.
- Enables theoretical end-to-end first-packet latency of 234 ms for streaming speech.
- Provides a fine-tuned Captioner variant for detailed audio descriptions with low hallucination.
Where Pith is reading between the lines
- This approach could allow future models to integrate even more modalities without performance loss.
- The low-latency streaming method might extend to other generative tasks beyond speech.
- Releasing the Captioner model could accelerate development of better audio analysis tools.
- The Thinking model variant demonstrates explicit reasoning over any input modality.
Load-bearing premise
The selected 36 audio and audio-visual benchmarks represent real-world multimodal performance without bias from benchmark choice or evaluation setup.
What would settle it
Results on a new, independently designed set of multimodal benchmarks where Qwen3-Omni shows clear degradation compared to specialized single-modal models.
read the original abstract
We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Qwen3-Omni, a unified multimodal model using a Thinker-Talker MoE architecture for perception and generation across text, image, audio, and video. It claims to match same-sized single-modal Qwen models with no degradation on text/image/video tasks while achieving open-source SOTA on 32 of 36 audio/audio-visual benchmarks and overall SOTA on 22, outperforming closed-source systems such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Additional contributions include multi-language support (119 text, 19 speech understanding, 10 speech generation), a multi-codebook streaming synthesis method yielding 234 ms theoretical first-packet latency via causal ConvNet, a Thinking model for multimodal reasoning, and a fine-tuned audio captioner variant; the 30B-A3B, Thinking, and Captioner models are released under Apache 2.0.
Significance. If the no-degradation and SOTA claims are substantiated by controlled, reproducible evaluations, the work would represent a meaningful advance in unified multimodal systems by showing that a single model can avoid typical cross-modal trade-offs while adding practical streaming and captioning capabilities. The open release and focus on audio excellence would facilitate community follow-up and applications in multilingual settings.
major comments (2)
- [Abstract] Abstract: the central claim that Qwen3-Omni 'maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts' and 'matches the performance of same-sized single-modal models within the Qwen series' is load-bearing, yet no quantitative tables, error bars, ablation results, or protocol details (prompt templates, decoding, data versions) are referenced to support direct head-to-head comparisons under identical conditions.
- [Abstract] Abstract (audio benchmarks paragraph): the assertion of open-source SOTA on 32/36 and overall SOTA on 22 benchmarks versus Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe rests on unstated evaluation equivalence; without disclosed re-evaluation of baselines under the same setup or exclusion rules, the cross-model superiority cannot be verified and directly affects the 'excels particularly on audio tasks' contribution.
minor comments (2)
- [Abstract] The abstract lists language support counts (119/19/10) but does not indicate whether these are supported in all modalities or only specific ones; a clarifying sentence or table would improve precision.
- [Abstract] The multi-codebook streaming mechanism and replacement of block-wise diffusion by causal ConvNet are described at high level; a short diagram or pseudocode would aid reproducibility of the 234 ms latency claim.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for recognizing the potential impact of Qwen3-Omni. We address the two major comments on the abstract below, providing point-by-point clarifications drawn from the full manuscript and committing to targeted revisions that improve transparency without altering the reported results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that Qwen3-Omni 'maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts' and 'matches the performance of same-sized single-modal models within the Qwen series' is load-bearing, yet no quantitative tables, error bars, ablation results, or protocol details (prompt templates, decoding, data versions) are referenced to support direct head-to-head comparisons under identical conditions.
Authors: We appreciate this observation. The manuscript contains the requested quantitative support in Sections 4 and 5. Section 4 presents head-to-head comparisons on text, image, and video benchmarks (Tables 1–4) against the corresponding single-modal Qwen2.5 and Qwen2 models of matching size, with per-task scores, standard deviations where multiple seeds were run, and explicit statements that no degradation occurs. Section 5 extends this to audio and audio-visual tasks (Tables 5–8). Ablations on the Thinker-Talker MoE routing, modality-specific adapters, and codebook usage appear in Section 6. Full protocol details—including prompt templates, decoding parameters (temperature, top-p), data versions, and benchmark splits—are provided in Section 3.3 and the appendix. To address the referee’s concern directly, we will revise the abstract to include explicit cross-references (e.g., “as shown in Tables 2 and 5 and detailed in Section 3.3”). This change makes the load-bearing claim traceable while preserving the abstract’s brevity. revision: yes
-
Referee: [Abstract] Abstract (audio benchmarks paragraph): the assertion of open-source SOTA on 32/36 and overall SOTA on 22 benchmarks versus Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe rests on unstated evaluation equivalence; without disclosed re-evaluation of baselines under the same setup or exclusion rules, the cross-model superiority cannot be verified and directly affects the 'excels particularly on audio tasks' contribution.
Authors: We agree that evaluation equivalence must be stated clearly. The 32/36 open-source SOTA and 22 overall SOTA counts are derived from the standardized benchmark suite described in Section 5. For open-source models we report our own runs under identical prompts and decoding settings; for closed-source systems (Gemini-2.5-Pro, GPT-4o-Transcribe, Seed-ASR) we used the latest publicly released API versions with the exact same benchmark inputs and post-processing rules as our model. Any exclusions (e.g., language-specific subsets or modality mismatches) are enumerated in the appendix table that accompanies each benchmark. We will add a concise clarifying clause to the abstract (“evaluated under consistent protocols; see Section 5 and Appendix B”) and expand the evaluation paragraph in Section 5 to list the precise API versions, prompt templates, and exclusion criteria used for each baseline. These revisions will allow independent verification of the audio-task superiority claim. revision: partial
Circularity Check
No circularity; performance claims rest on external benchmark comparisons
full rationale
The paper presents Qwen3-Omni as a multimodal model whose central claims are empirical: it matches single-modal Qwen baselines and achieves SOTA on 32 of 36 audio benchmarks while outperforming closed-source models. These results are reported as direct evaluations rather than derived quantities. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the architecture description (Thinker-Talker MoE) or latency techniques. The fine-tuning for the Captioner variant is an explicit post-training step, not a circular derivation. Any self-citations (if present in the full text) are not load-bearing for the performance assertions, which rely on external benchmarks. The derivation chain is therefore self-contained against independent test sets.
Axiom & Free-Parameter Ledger
free parameters (2)
- MoE expert count and routing parameters
- Multi-codebook speech codec configuration
axioms (1)
- domain assumption A single model can match specialized single-modal performance across modalities when using appropriate architecture and training.
Forward citations
Cited by 60 Pith papers
-
Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
EgoSound: Benchmarking Sound Understanding in Egocentric Videos
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
-
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than ...
-
VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding
VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in...
-
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
-
Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding
Seizure-Semiology-Suite provides a new clinically annotated video dataset and hierarchical benchmark that exposes weaknesses in current MLLMs for seizure semiology and demonstrates gains from fine-tuning and a neuro-s...
-
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
OmniPro is the first benchmark jointly evaluating omni-modal perception, proactive responding, and diverse streaming video understanding tasks using a dual-mode protocol on 2700 samples.
-
InstructAV2AV: Instruction-Guided Audio-Video Joint Editing
InstructAV2AV is an end-to-end instruction-guided audio-video joint editing model that adapts a pre-trained backbone with gated attention and two-stage training, outperforming prior methods on 11 metrics after buildin...
-
CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings
CBT-Audio dataset shows that adding audio input improves distress intensity estimation over transcripts alone for 8 of 10 audio language models, with clearest gains when verbal content and vocal delivery diverge.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
-
Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vis...
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
-
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
-
StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
-
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
-
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
-
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
-
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.
-
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
-
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
-
OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments
OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.
-
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
KoALa-Bench is a new public benchmark with six tasks that tests Korean speech recognition, translation, question answering, instruction following, and faithfulness in large audio language models.
-
DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning
A new 1695-sample multicultural dataset plus two modules for stable multimodal fusion and modality consistency yield state-of-the-art deception detection with cross-cultural transfer.
-
TiCo: Time-Controllable Spoken Dialogue Model
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
-
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning
VideoDR is a new benchmark for open-web video deep research that tests multimodal models on cross-frame visual anchor extraction, interactive retrieval, and multi-hop reasoning over joint video-web evidence.
-
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
-
M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.
-
Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?
MLLM representation spaces are dominated by textual semantics that reduce discriminative power for multimodal retrieval; a whitening transformation called ReAlign corrects the geometry and boosts zero-shot performance.
-
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.
-
AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.
-
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
-
ArchMap: Arch-Flattening and Knowledge-Guided Vision Language Model for Tooth Counting and Structured Dental Understanding
ArchMap combines geometric arch-flattening with a dental knowledge base to guide VLMs for accurate tooth counting and structured understanding of 3D intraoral scans without training.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
Multimodal LLMs under Pairwise Modalities
A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.
-
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
-
RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
RE-VLM is the first dual-stream VLM combining RGB and event data with a graph-based pipeline to generate training captions and QA pairs, showing gains over RGB-only and event-only models on new datasets for challengin...
-
WavFlow: Audio Generation in Waveform Space
WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.
-
Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models
AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.
-
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
-
S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation
S2Accompanist is a 402M-parameter semantic-aware diffusion model that achieves SOTA on the ATTM Grand Challenge benchmark for music accompaniment generation via automated data processing and structure-guided VAE fine-tuning.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...
-
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
-
When Vision Speaks for Sound
Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention pe...
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
Reference graph
Works this paper leans on
-
[1]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
URL https://artofproblemsolving.com/wiki/index.php/A IME_Problems_and_Solutions. Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,
work page internal anchor Pith review arXiv
-
[2]
URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_ 3.pdf. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024a. Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants.arXiv p...
work page internal anchor Pith review arXiv
-
[4]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
21 Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.CoRR, abs/2311.07919,
work page internal anchor Pith review arXiv
-
[5]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,
work page internal anchor Pith review arXiv
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,
work page internal anchor Pith review arXiv
-
[8]
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, and Jieping Ye. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.CoRR, abs/2505.17589,
work page internal anchor Pith review arXiv
-
[9]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21075,
work page internal anchor Pith review arXiv
-
[11]
Are we done with mmlu? CoRR, abs/2406.04127,
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu?CoRR, abs/2406.04127,
-
[12]
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni_v1_5_report.pdf. Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128,
work page internal anchor Pith review arXiv
-
[13]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogniti...
work page 2024
-
[14]
Multi-if: Benchmarking llms on multi-turn and multilingual instructions following
22 Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, and Sinong Wang. Multi-if: Benchmarking llms on multi-turn and multilingual instructions following.CoRR, abs/2410.15553,
-
[15]
Multi-if: Benchmarking llms on multi-turn and multilingual instructions following
doi: 10.48550 /ARXIV.2410.15553. URLhttps://doi.org/10.48550/arXiv.2410.15553. Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.CoRR, abs/2502.04326,
-
[16]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv:2301.12597,
work page internal anchor Pith review arXiv
-
[17]
Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. ZebraLogic: On the scaling limits of LLMs for logical reasoning.CoRR, abs/2502.01100,
-
[18]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv:2304.08485,
work page internal anchor Pith review arXiv
-
[19]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv:2203.10244,
work page internal anchor Pith review arXiv
-
[20]
URL https://github.com/openai/openai-python/blob/e389823ba013a24b4c3 2ce38fa0bd87e6bccae94/chatml.md. OpenAI. GPT4 technical report.CoRR, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel
URLhttps://eqbench.com/creative_writing.html. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 3147–3157. IEEE,
work page 2023
-
[22]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,
work page internal anchor Pith review arXiv
-
[23]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
URLhttps://arxiv.org/abs/2410.19168. Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568:127063,
work page internal anchor Pith review arXiv
-
[24]
video-SALMONN 2: Caption-enhanced audio-visual large language models
Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-salmonn 2: Captioning-enhanced audio-visual large language models.CoRR, abs/2506.15220,
-
[25]
Llama 2: Open Foundation and Fine-Tuned Chat Models
23 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.htt...
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark.CoRR, abs/2506.04779, 2025a. doi: 10.48550/ARXIV.2506.04779. URL https://doi.org/10.48550/arXiv.250 6.04779. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongshe...
work page internal anchor Pith review doi:10.48550/arxiv.2506.04779
-
[28]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv:2407.10671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Qize Yang, Shimin Yao, Weixuan chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understanding to omni-m...
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
24 Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.arXiv:2311.16502,
work page internal anchor Pith review arXiv
-
[31]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813,
work page internal anchor Pith review arXiv
-
[32]
Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks,
Yongyi Zang, Sean O’Brien, Taylor Berg-Kirkpatrick, Julian McAuley, and Zachary Novack. Are you really listening? boosting perceptual awareness in music-qa benchmarks.arXiv preprint arXiv:2504.00369,
-
[33]
Minimax- speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder,
Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, and Yucen He. Minimax-speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder.CoRR, abs/2505.07916,
-
[34]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Mlvu: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: benchmarking multi-task long video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pp. 13691–13701. Computer Vision Founda...
-
[37]
Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen. Muq: Self-supervised music representation learning with mel residual vector quantization. arXiv preprint arXiv:2501.01108,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.