Qwen3-omni technical report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang · 2025

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder

cs.CV · 2026-05-02 · unverdicted · novelty 6.0

Omni-Encoder unifies visual and audio encoding at symmetrical 25 fps using a Transformer with three new components, yielding gains on fine-grained motion tasks while matching baselines on audio-visual benchmarks.

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

UAF is the first unified audio front-end LLM that turns multiple front-end tasks into one sequence prediction model processing streaming audio chunks and reference prompts to output semantic and control tokens for full-duplex interaction.

HumanOmni-Speaker: Identifying Who said What and When

cs.CV · 2026-03-23 · unverdicted · novelty 6.0

HumanOmni-Speaker introduces a Visual Delta Encoder and VR-SDR benchmark that enable end-to-end speaker diarization and recognition by sampling video at 25 fps and compressing inter-frame motion residuals into 6 tokens per frame.

OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

OmniSelect is a training-free, modality-adaptive token pruning framework that dynamically selects Audio-Centric, Video-Centric, or Uniform compression regimes using AudioCLIP cross-modal relevance scores and then applies adaptive fine-grained pruning within temporal groups.

citing papers explorer

Showing 4 of 4 citing papers.

OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder cs.CV · 2026-05-02 · unverdicted · none · ref 10
Omni-Encoder unifies visual and audio encoding at symmetrical 25 fps using a Transformer with three new components, yielding gains on fine-grained motion tasks while matching baselines on audio-visual benchmarks.
UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction cs.AI · 2026-04-21 · unverdicted · none · ref 35
UAF is the first unified audio front-end LLM that turns multiple front-end tasks into one sequence prediction model processing streaming audio chunks and reference prompts to output semantic and control tokens for full-duplex interaction.
HumanOmni-Speaker: Identifying Who said What and When cs.CV · 2026-03-23 · unverdicted · none · ref 2
HumanOmni-Speaker introduces a Visual Delta Encoder and VR-SDR benchmark that enable end-to-end speaker diarization and recognition by sampling video at 25 fps and compressing inter-frame motion residuals into 6 tokens per frame.
OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models cs.CV · 2026-05-18 · unverdicted · none · ref 27
OmniSelect is a training-free, modality-adaptive token pruning framework that dynamically selects Audio-Centric, Video-Centric, or Uniform compression regimes using AudioCLIP cross-modal relevance scores and then applies adaptive fine-grained pruning within temporal groups.

Qwen3-omni technical report

fields

years

verdicts

representative citing papers

citing papers explorer