pith. sign in

arxiv: 2509.17765 · v1 · submitted 2025-09-22 · 💻 cs.CL · cs.AI· cs.CV· eess.AS

Qwen3-Omni Technical Report

Pith reviewed 2026-05-11 00:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVeess.AS
keywords multimodal AIQwen3-Omniaudio-visual modelspeech synthesismixture of expertsstreaming generationmultilingual support
0
0 comments X

The pith

Qwen3-Omni maintains state-of-the-art performance on text, image, audio, and video tasks in a single model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Qwen3-Omni as a unified multimodal model that achieves top performance across multiple input types without the usual trade-offs. It uses a Thinker-Talker mixture-of-experts setup to handle both understanding and generation for text, images, audio, and video. On audio tasks it leads on most benchmarks while also supporting real-time speech output with low latency. This suggests it is possible to build one model that does not sacrifice accuracy when adding more capabilities.

Core claim

Qwen3-Omni maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. It adopts a Thinker-Talker MoE architecture that unifies perception and generation, yielding fluent text and natural real-time speech. Across 36 audio and audio-visual benchmarks, it achieves open-source SOTA on 32 and overall SOTA on 22, outperforming closed-source models like Gemini-2.5-Pro.

What carries the argument

The Thinker-Talker MoE architecture, which separates thinking and talking components to unify multimodal perception and generation, combined with multi-codebook discrete speech codecs for low-latency streaming synthesis.

If this is right

  • Matches performance of same-sized single-modal Qwen models on all modalities.
  • Excels on audio tasks, leading 32 out of 36 benchmarks.
  • Supports text in 119 languages, speech understanding in 19, and generation in 10.
  • Enables theoretical end-to-end first-packet latency of 234 ms for streaming speech.
  • Provides a fine-tuned Captioner variant for detailed audio descriptions with low hallucination.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could allow future models to integrate even more modalities without performance loss.
  • The low-latency streaming method might extend to other generative tasks beyond speech.
  • Releasing the Captioner model could accelerate development of better audio analysis tools.
  • The Thinking model variant demonstrates explicit reasoning over any input modality.

Load-bearing premise

The selected 36 audio and audio-visual benchmarks represent real-world multimodal performance without bias from benchmark choice or evaluation setup.

What would settle it

Results on a new, independently designed set of multimodal benchmarks where Qwen3-Omni shows clear degradation compared to specialized single-modal models.

read the original abstract

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Qwen3-Omni, a unified multimodal model using a Thinker-Talker MoE architecture for perception and generation across text, image, audio, and video. It claims to match same-sized single-modal Qwen models with no degradation on text/image/video tasks while achieving open-source SOTA on 32 of 36 audio/audio-visual benchmarks and overall SOTA on 22, outperforming closed-source systems such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Additional contributions include multi-language support (119 text, 19 speech understanding, 10 speech generation), a multi-codebook streaming synthesis method yielding 234 ms theoretical first-packet latency via causal ConvNet, a Thinking model for multimodal reasoning, and a fine-tuned audio captioner variant; the 30B-A3B, Thinking, and Captioner models are released under Apache 2.0.

Significance. If the no-degradation and SOTA claims are substantiated by controlled, reproducible evaluations, the work would represent a meaningful advance in unified multimodal systems by showing that a single model can avoid typical cross-modal trade-offs while adding practical streaming and captioning capabilities. The open release and focus on audio excellence would facilitate community follow-up and applications in multilingual settings.

major comments (2)
  1. [Abstract] Abstract: the central claim that Qwen3-Omni 'maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts' and 'matches the performance of same-sized single-modal models within the Qwen series' is load-bearing, yet no quantitative tables, error bars, ablation results, or protocol details (prompt templates, decoding, data versions) are referenced to support direct head-to-head comparisons under identical conditions.
  2. [Abstract] Abstract (audio benchmarks paragraph): the assertion of open-source SOTA on 32/36 and overall SOTA on 22 benchmarks versus Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe rests on unstated evaluation equivalence; without disclosed re-evaluation of baselines under the same setup or exclusion rules, the cross-model superiority cannot be verified and directly affects the 'excels particularly on audio tasks' contribution.
minor comments (2)
  1. [Abstract] The abstract lists language support counts (119/19/10) but does not indicate whether these are supported in all modalities or only specific ones; a clarifying sentence or table would improve precision.
  2. [Abstract] The multi-codebook streaming mechanism and replacement of block-wise diffusion by causal ConvNet are described at high level; a short diagram or pseudocode would aid reproducibility of the 234 ms latency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the potential impact of Qwen3-Omni. We address the two major comments on the abstract below, providing point-by-point clarifications drawn from the full manuscript and committing to targeted revisions that improve transparency without altering the reported results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Qwen3-Omni 'maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts' and 'matches the performance of same-sized single-modal models within the Qwen series' is load-bearing, yet no quantitative tables, error bars, ablation results, or protocol details (prompt templates, decoding, data versions) are referenced to support direct head-to-head comparisons under identical conditions.

    Authors: We appreciate this observation. The manuscript contains the requested quantitative support in Sections 4 and 5. Section 4 presents head-to-head comparisons on text, image, and video benchmarks (Tables 1–4) against the corresponding single-modal Qwen2.5 and Qwen2 models of matching size, with per-task scores, standard deviations where multiple seeds were run, and explicit statements that no degradation occurs. Section 5 extends this to audio and audio-visual tasks (Tables 5–8). Ablations on the Thinker-Talker MoE routing, modality-specific adapters, and codebook usage appear in Section 6. Full protocol details—including prompt templates, decoding parameters (temperature, top-p), data versions, and benchmark splits—are provided in Section 3.3 and the appendix. To address the referee’s concern directly, we will revise the abstract to include explicit cross-references (e.g., “as shown in Tables 2 and 5 and detailed in Section 3.3”). This change makes the load-bearing claim traceable while preserving the abstract’s brevity. revision: yes

  2. Referee: [Abstract] Abstract (audio benchmarks paragraph): the assertion of open-source SOTA on 32/36 and overall SOTA on 22 benchmarks versus Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe rests on unstated evaluation equivalence; without disclosed re-evaluation of baselines under the same setup or exclusion rules, the cross-model superiority cannot be verified and directly affects the 'excels particularly on audio tasks' contribution.

    Authors: We agree that evaluation equivalence must be stated clearly. The 32/36 open-source SOTA and 22 overall SOTA counts are derived from the standardized benchmark suite described in Section 5. For open-source models we report our own runs under identical prompts and decoding settings; for closed-source systems (Gemini-2.5-Pro, GPT-4o-Transcribe, Seed-ASR) we used the latest publicly released API versions with the exact same benchmark inputs and post-processing rules as our model. Any exclusions (e.g., language-specific subsets or modality mismatches) are enumerated in the appendix table that accompanies each benchmark. We will add a concise clarifying clause to the abstract (“evaluated under consistent protocols; see Section 5 and Appendix B”) and expand the evaluation paragraph in Section 5 to list the precise API versions, prompt templates, and exclusion criteria used for each baseline. These revisions will allow independent verification of the audio-task superiority claim. revision: partial

Circularity Check

0 steps flagged

No circularity; performance claims rest on external benchmark comparisons

full rationale

The paper presents Qwen3-Omni as a multimodal model whose central claims are empirical: it matches single-modal Qwen baselines and achieves SOTA on 32 of 36 audio benchmarks while outperforming closed-source models. These results are reported as direct evaluations rather than derived quantities. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the architecture description (Thinker-Talker MoE) or latency techniques. The fine-tuning for the Captioner variant is an explicit post-training step, not a circular derivation. Any self-citations (if present in the full text) are not load-bearing for the performance assertions, which rely on external benchmarks. The derivation chain is therefore self-contained against independent test sets.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard large-model training assumptions; the design implicitly assumes that multimodal data can be unified via MoE without trade-offs and that discrete speech codecs plus causal ConvNet suffice for low-latency generation.

free parameters (2)
  • MoE expert count and routing parameters
    Chosen to allocate capacity across text, image, audio, and video modalities during training.
  • Multi-codebook speech codec configuration
    Selected to enable autoregressive prediction and low first-packet latency.
axioms (1)
  • domain assumption A single model can match specialized single-modal performance across modalities when using appropriate architecture and training.
    Central premise underlying the no-degradation claim.

pith-pipeline@v0.9.0 · 5825 in / 1466 out tokens · 79190 ms · 2026-05-11T00:15:39.055048+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

    cs.SD 2026-05 unverdicted novelty 8.0

    Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

  2. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  3. RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

    cs.CV 2026-04 unverdicted novelty 8.0

    RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

  4. VoxSafeBench: Not Just What Is Said, but Who, How, and Where

    cs.SD 2026-04 unverdicted novelty 8.0

    VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

  5. EgoSound: Benchmarking Sound Understanding in Egocentric Videos

    cs.CV 2026-02 unverdicted novelty 8.0

    EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

  6. Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

    cs.LG 2025-12 conditional novelty 8.0

    Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than ...

  7. VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in...

  8. Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.

  9. Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    Seizure-Semiology-Suite provides a new clinically annotated video dataset and hierarchical benchmark that exposes weaknesses in current MLLMs for seizure semiology and demonstrates gains from fine-tuning and a neuro-s...

  10. OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    OmniPro is the first benchmark jointly evaluating omni-modal perception, proactive responding, and diverse streaming video understanding tasks using a dual-mode protocol on 2700 samples.

  11. InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    InstructAV2AV is an end-to-end instruction-guided audio-video joint editing model that adapts a pre-trained backbone with gated attention and two-stage training, outperforming prior methods on 11 metrics after buildin...

  12. CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

    cs.AI 2026-05 unverdicted novelty 7.0

    CBT-Audio dataset shows that adding audio input improves distress intensity estimation over transcripts alone for 8 of 10 audio language models, with clearest gains when verbal content and vocal delivery diverge.

  13. From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.

  14. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  15. FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

    cs.MM 2026-05 unverdicted novelty 7.0

    FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.

  16. Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

    cs.CV 2026-05 unverdicted novelty 7.0

    Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.

  17. MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

    cs.CL 2026-05 unverdicted novelty 7.0

    MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.

  18. Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization

    cs.CV 2026-05 unverdicted novelty 7.0

    DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vis...

  19. Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

    cs.CL 2026-05 unverdicted novelty 7.0

    TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.

  20. Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

    cs.CR 2026-05 conditional novelty 7.0

    Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.

  21. EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.

  22. StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.

  23. Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

    eess.AS 2026-04 unverdicted novelty 7.0

    LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.

  24. SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

  25. ATIR: Towards Audio-Text Interleaved Contextual Retrieval

    cs.SD 2026-04 unverdicted novelty 7.0

    Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.

  26. From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

    cs.AI 2026-04 unverdicted novelty 7.0

    ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.

  27. Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.

  28. HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

    eess.AS 2026-04 unverdicted novelty 7.0

    HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...

  29. Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.

  30. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.

  31. VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

    cs.SD 2026-04 unverdicted novelty 7.0

    VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.

  32. SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.

  33. CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

    cs.SD 2026-04 unverdicted novelty 7.0

    CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

  34. Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

    eess.AS 2026-04 unverdicted novelty 7.0

    Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...

  35. OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

    cs.HC 2026-04 unverdicted novelty 7.0

    OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.

  36. KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness

    cs.CL 2026-03 unverdicted novelty 7.0

    KoALa-Bench is a new public benchmark with six tasks that tests Korean speech recognition, translation, question answering, instruction following, and faithfulness in large audio language models.

  37. DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning

    cs.CV 2026-03 unverdicted novelty 7.0

    A new 1695-sample multicultural dataset plus two modules for stable multimodal fusion and modality consistency yield state-of-the-art deception detection with cross-cultural transfer.

  38. TiCo: Time-Controllable Spoken Dialogue Model

    cs.CL 2026-03 unverdicted novelty 7.0

    TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

  39. OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

    cs.CL 2026-03 unverdicted novelty 7.0

    OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.

  40. SCP: Spatial Causal Prediction in Video

    cs.CV 2026-03 unverdicted novelty 7.0

    SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

  41. Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

    cs.CV 2026-01 unverdicted novelty 7.0

    VideoDR is a new benchmark for open-web video deep research that tests multimodal models on cross-frame visual anchor extraction, interactive retrieval, and multi-hop reasoning over joint video-web evidence.

  42. Omni2Sound: Towards Unified Video-Text-to-Audio Generation

    cs.SD 2026-01 unverdicted novelty 7.0

    A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.

  43. M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

    cs.CL 2025-12 unverdicted novelty 7.0

    M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.

  44. Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

    cs.CV 2025-12 conditional novelty 7.0

    MLLM representation spaces are dominated by textual semantics that reduce discriminative power for multimodal retrieval; a whitening transformation called ReAlign corrects the geometry and boosts zero-shot performance.

  45. VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

    cs.CV 2025-12 unverdicted novelty 7.0

    VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.

  46. AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

    cs.CV 2025-12 unverdicted novelty 7.0

    AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.

  47. See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

    cs.CV 2025-12 unverdicted novelty 7.0

    AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.

  48. ArchMap: Arch-Flattening and Knowledge-Guided Vision Language Model for Tooth Counting and Structured Dental Understanding

    cs.CV 2025-11 unverdicted novelty 7.0

    ArchMap combines geometric arch-flattening with a dental knowledge base to guide VLMs for accurate tooth counting and structured understanding of 3D intraoral scans without training.

  49. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    cs.CV 2025-02 unverdicted novelty 7.0

    WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

  50. Multimodal LLMs under Pairwise Modalities

    cs.CV 2026-05 unverdicted novelty 6.0

    A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.

  51. OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

    cs.LG 2026-05 unverdicted novelty 6.0

    OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

  52. RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    RE-VLM is the first dual-stream VLM combining RGB and event data with a graph-based pipeline to generate training captions and QA pairs, showing gains over RGB-only and event-only models on new datasets for challengin...

  53. WavFlow: Audio Generation in Waveform Space

    cs.SD 2026-05 conditional novelty 6.0

    WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.

  54. Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

    cs.CR 2026-05 unverdicted novelty 6.0

    AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.

  55. See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

  56. S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation

    eess.AS 2026-05 unverdicted novelty 6.0

    S2Accompanist is a 402M-parameter semantic-aware diffusion model that achieves SOTA on the ATTM Grand Challenge benchmark for music accompaniment generation via automated data processing and structure-guided VAE fine-tuning.

  57. From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...

  58. SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

    cs.SD 2026-05 unverdicted novelty 6.0

    SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.

  59. When Vision Speaks for Sound

    cs.CV 2026-05 unverdicted novelty 6.0

    Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention pe...

  60. Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

    eess.AS 2026-05 unverdicted novelty 6.0

    A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 119 Pith papers · 26 internal anchors

  1. [1]

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    URL https://artofproblemsolving.com/wiki/index.php/A IME_Problems_and_Solutions. Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,

  2. [2]

    URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_ 3.pdf. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng...

  3. [3]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024a. Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants.arXiv p...

  4. [4]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    21 Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.CoRR, abs/2311.07919,

  5. [5]

    Qwen2-Audio Technical Report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  7. [7]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,

  8. [8]

    CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, and Jieping Ye. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.CoRR, abs/2505.17589,

  9. [9]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

  10. [10]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21075,

  11. [11]

    Are we done with mmlu? CoRR, abs/2406.04127,

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu?CoRR, abs/2406.04127,

  12. [12]

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni_v1_5_report.pdf. Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128,

  13. [13]

    Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogniti...

  14. [14]

    Multi-if: Benchmarking llms on multi-turn and multilingual instructions following

    22 Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, and Sinong Wang. Multi-if: Benchmarking llms on multi-turn and multilingual instructions following.CoRR, abs/2410.15553,

  15. [15]

    Multi-if: Benchmarking llms on multi-turn and multilingual instructions following

    doi: 10.48550 /ARXIV.2410.15553. URLhttps://doi.org/10.48550/arXiv.2410.15553. Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.CoRR, abs/2502.04326,

  16. [16]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv:2301.12597,

  17. [17]

    Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

    Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. ZebraLogic: On the scaling limits of LLMs for logical reasoning.CoRR, abs/2502.01100,

  18. [18]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv:2304.08485,

  19. [19]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv:2203.10244,

  20. [20]

    URL https://github.com/openai/openai-python/blob/e389823ba013a24b4c3 2ce38fa0bd87e6bccae94/chatml.md. OpenAI. GPT4 technical report.CoRR, abs/2303.08774,

  21. [21]

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel

    URLhttps://eqbench.com/creative_writing.html. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 3147–3157. IEEE,

  22. [22]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

  23. [23]

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    URLhttps://arxiv.org/abs/2410.19168. Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568:127063,

  24. [24]

    video-SALMONN 2: Caption-enhanced audio-visual large language models

    Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-salmonn 2: Captioning-enhanced audio-visual large language models.CoRR, abs/2506.15220,

  25. [25]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    23 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv:2307.09288,

  26. [26]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.htt...

  27. [27]

    MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

    Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark.CoRR, abs/2506.04779, 2025a. doi: 10.48550/ARXIV.2506.04779. URL https://doi.org/10.48550/arXiv.250 6.04779. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongshe...

  28. [28]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv:2407.10671,

  29. [29]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Qize Yang, Shimin Yao, Weixuan chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understanding to omni-m...

  30. [30]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    24 Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.arXiv:2311.16502,

  31. [31]

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813,

  32. [32]

    Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks,

    Yongyi Zang, Sean O’Brien, Taylor Berg-Kirkpatrick, Julian McAuley, and Zachary Novack. Are you really listening? boosting perceptual awareness in music-qa benchmarks.arXiv preprint arXiv:2504.00369,

  33. [33]

    Minimax- speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder,

    Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, and Yucen He. Minimax-speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder.CoRR, abs/2505.07916,

  34. [34]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

  35. [35]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

  36. [36]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: benchmarking multi-task long video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pp. 13691–13701. Computer Vision Founda...

  37. [37]

    Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen

    Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen. Muq: Self-supervised music representation learning with mel residual vector quantization. arXiv preprint arXiv:2501.01108,