pith. machine review for the scientific record. sign in

arxiv: 2605.12500 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Bo Liu, Chengguang Lv, Dahua Lin, Haiwen Diao, HanMing Deng, Haojia Yu, Haozhe Xie, Hongli Wang, Jiahao Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lei Yang, Lewei Lu, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Pengfei Liu, Peng Gao, Penghao Wu, Qingping Sun, Quan Wang, Ruihao Gong, Rui Shen, Ruisi Wang, Shengnan Ma, Shihao Bai, Shuang Yang, Silei Wu, Siying Li, Siyi Xie, Tianbo Zhong, Weichen Fan, Wenjie Ye, Wenwen Tong, Wenxiu Sun, Xiangli Kong, Xiangyu Fan, Xuanke Shi, Yang Gao, Yan Li, Yongqiang Yao, Yubo Wang, Yue Zhu, Yuwei Niu, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Ziwei Liu, Zixin Yin

Pith reviewed 2026-05-13 05:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal unificationvision-language modelsunderstanding and generationNEO-unify architectureSenseNova-U1any-to-image synthesisinterleaved generationnative multimodal intelligence
0
0 comments X

The pith

SenseNova-U1 treats multimodal understanding and generation as synergistic views of one underlying process using the NEO-unify architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that separating understanding from generation in vision-language models creates a built-in barrier to integrated intelligence rather than a fixable engineering issue. By designing from first principles around a single shared process, the SenseNova-U1 models allow both capabilities to develop together instead of in separate pipelines or misaligned spaces. The resulting 8B and 30B-scale variants match specialized understanding systems across perception, reasoning, and decision tasks while also producing coherent images, text-rich graphics, and interleaved outputs. Early tests further indicate the same models handle action and world-modeling scenarios without added components. This setup replaces translation between modalities with native cross-modal thinking.

Core claim

We introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. Two variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, are built on dense and mixture-of-experts baselines. They match top understanding-only VLMs on text, vision-language perception, knowledge reasoning, agentic decisions, and spatial tasks. At the same time they produce strong semantic consistency and visual fidelity in any-to-image synthesis, complex infographic generation, and interleaved vision-language outputs, with or without explicit think patterns. Preliminary results also show capability in VLA (

What carries the argument

The NEO-unify architecture, which designs understanding and generation to develop as synergistic views of a single underlying process rather than distinct modules or cascaded stages.

If this is right

  • The models match top-tier understanding-only VLMs on text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence.
  • They achieve strong semantic consistency and visual fidelity in any-to-image synthesis, text-rich infographic generation, and interleaved vision-language generation.
  • Preliminary results extend the same models to vision-language-action and world-model scenarios without separate modules.
  • Multimodal systems can move from translating between modalities to thinking and acting across them natively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A working unified process would eliminate the need for separate representation alignment steps between understanding and generation pipelines.
  • If capabilities emerge internally, future scaling efforts could focus on one shared training regime instead of maintaining dual specialized systems.
  • The approach suggests testing whether the same architecture supports additional modalities or longer-horizon planning without adding explicit bridges.
  • Direct measurement of training stability and representation overlap during joint optimization would reveal whether the claimed absence of trade-offs holds in practice.

Load-bearing premise

The split between understanding and generation arises from a structural limitation in current designs rather than an unavoidable engineering necessity, and the NEO-unify architecture can integrate them without hidden costs to alignment or stability.

What would settle it

If the unified SenseNova-U1 models show clear performance drops relative to separate top understanding VLMs on standard benchmarks or to specialized generators on image fidelity and consistency metrics, the claim that unification removes structural barriers would not hold.

read the original abstract

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces SenseNova-U1, a native unified multimodal paradigm built on the NEO-unify architecture, in which understanding and generation are treated as synergistic views of a single underlying process rather than separate problems. It presents two variants (SenseNova-U1-8B-MoT on a dense 8B baseline and SenseNova-U1-A3B-MoT on a 30B-A3B MoE baseline) that claim to match or rival top-tier understanding-only VLMs on text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence tasks, while also delivering strong performance on any-to-image synthesis, text-rich infographic generation, and interleaved vision-language generation. The manuscript further describes model design, data preprocessing, pre-/post-training, and inference strategies, and provides preliminary evidence of extension to vision-language-action (VLA) and world model (WM) scenarios.

Significance. If the empirical claims and architectural unification hold, the work would be significant for demonstrating that a single native architecture can eliminate the typical fragmentation between understanding and generation pipelines without incurring hidden trade-offs in representation alignment or task performance. The explicit release of design details, data strategies, and preliminary VLA/WM results would support reproducibility and further research toward models that think and act natively across modalities rather than translating between them.

major comments (2)
  1. [Abstract and §3 (Architecture)] The central unification claim—that the understanding-generation dichotomy is structural rather than an engineering choice and that NEO-unify enables synergistic emergence without trade-offs—requires explicit supporting evidence. The abstract asserts rivaling performance on understanding benchmarks while adding generation capabilities, but no quantitative comparisons, ablation tables, or alignment metrics are referenced to demonstrate that the unified model avoids the expected degradation in either task family.
  2. [§4 (Model Design) and §5 (Training)] The variants are described as built on 'dense (8B) and mixture-of-experts (30B-A3B) understanding baselines,' yet no details are given on how the NEO-unify modifications (e.g., shared representation spaces or joint training objectives) are implemented or evaluated for stability. Without these, it is impossible to assess whether the reported performance stems from the unification or from scaling the underlying baselines.
minor comments (3)
  1. [Abstract] The abstract uses the term 'MoT' without expansion; clarify whether this denotes Mixture-of-Transformers, Mixture-of-Experts with specific routing, or another mechanism.
  2. [§6 (Experiments)] Claims of 'strong semantic consistency and visual fidelity' in generation tasks would benefit from reference to specific metrics (e.g., FID, CLIPScore, or human preference scores) and comparison baselines in the results section.
  3. [§7 (Extensions)] The preliminary VLA and WM results are described only at high level; adding even summary tables or qualitative examples would strengthen the broader roadmap argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of demonstrating a native unified multimodal architecture. We address the major comments point by point below, providing clarifications from the manuscript and committing to targeted revisions that strengthen the evidence for the unification claims without misrepresenting our results.

read point-by-point responses
  1. Referee: [Abstract and §3 (Architecture)] The central unification claim—that the understanding-generation dichotomy is structural rather than an engineering choice and that NEO-unify enables synergistic emergence without trade-offs—requires explicit supporting evidence. The abstract asserts rivaling performance on understanding benchmarks while adding generation capabilities, but no quantitative comparisons, ablation tables, or alignment metrics are referenced to demonstrate that the unified model avoids the expected degradation in either task family.

    Authors: We acknowledge that while the manuscript reports competitive results on understanding benchmarks (text, perception, reasoning, agentic, and spatial tasks) alongside strong generation performance (X2I, infographics, interleaved), the initial version did not include dedicated ablations or alignment metrics to isolate the absence of trade-offs. In the revised manuscript we will add a new subsection under §3 that presents: (i) side-by-side benchmark tables comparing SenseNova-U1 variants to their unmodified understanding baselines and to separately trained generation models, (ii) an ablation on the joint training objective versus sequential training, and (iii) quantitative representation alignment metrics (e.g., cross-modal cosine similarity and mutual information scores) before and after unification. These additions will directly support the synergistic-emergence claim. revision: yes

  2. Referee: [§4 (Model Design) and §5 (Training)] The variants are described as built on 'dense (8B) and mixture-of-experts (30B-A3B) understanding baselines,' yet no details are given on how the NEO-unify modifications (e.g., shared representation spaces or joint training objectives) are implemented or evaluated for stability. Without these, it is impossible to assess whether the reported performance stems from the unification or from scaling the underlying baselines.

    Authors: We agree that greater implementation transparency is required. Although §4 and §5 outline the overall design and training pipeline, they lack the granular specifications needed for full reproducibility. In the revision we will expand both sections with: (1) pseudocode and diagrams detailing the construction of the shared representation space via the MoT layers, (2) the exact formulation of the joint loss combining understanding and generation objectives, and (3) stability measures (gradient norms, learning-rate schedules, and curriculum strategies) together with ablation results that quantify the incremental contribution of each NEO-unify modification over the unmodified baselines. This will allow readers to distinguish unification effects from baseline scaling. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces SenseNova-U1 as a new unified architecture (NEO-unify) and supports its claims through empirical performance on understanding and generation tasks. No equations, fitted parameters, or derivations are presented in the abstract that reduce to inputs by construction. The central argument—that the understanding-generation divide is structural and that unification enables synergistic emergence—is framed as a design premise validated by results, not a self-referential tautology or self-citation chain. Full text verification would be needed for any hidden steps, but none are detectable from the provided material.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is limited to the high-level premise stated in the text.

axioms (1)
  • domain assumption The divide between understanding and generation is a structural limitation rather than an engineering artifact.
    Stated directly in the abstract as the starting argument for introducing the unified paradigm.
invented entities (1)
  • NEO-unify architecture no independent evidence
    purpose: Single underlying process that makes understanding and generation synergistic views.
    Introduced as the core technical contribution enabling native unification.

pith-pipeline@v0.9.0 · 5842 in / 1397 out tokens · 60627 ms · 2026-05-13T05:06:23.230885+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

173 extracted references · 173 canonical work pages · 40 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    Imagen 3

    Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3. arXiv preprint arXiv:2408.07009, 2024

  4. [4]

    Introducing our multimodal models, 2023

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa ˘gnak Ta¸ sırlar. Introducing our multimodal models, 2023. URLhttps://www.adept.ai/blog/fuyu-8b

  5. [5]

    Improving image generation with better captions, 2023

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions, 2023. URLhttps://cdn.openai.com/papers/dall-e-3.pdf

  6. [6]

    Seedream 4.5, 2025

    ByteDance. Seedream 4.5, 2025. URLhttps://seed.bytedance.com/en/seedream4_5

  7. [7]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

  8. [8]

    arXiv preprint arXiv:2505.22705 (2025)

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: An open-source high-efficient image generative foundation model. arXiv preprint arXiv:2505.22705, 2025

  9. [9]

    Holistic evaluation of multimodal llms on spatial intelligence

    Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holistic evaluation of multimodal llms on spatial intelligence. arXiv preprint arXiv:2508.13142, 2025

  10. [10]

    Scaling spatial intelligence with multimodal foundation models

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...

  11. [11]

    Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951, 2025

  12. [12]

    Oneig-bench: Omni-dimensional nuanced evaluation for image generation

    Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. arXiv preprint arXiv:2506.07977, 2025

  13. [13]

    Aligning visual foundation encoders to tokenizers for diffusion models

    Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models. arXiv preprint arXiv:2509.25162, 2025

  14. [14]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

  15. [15]

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard

    Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language. arXiv preprint arXiv:2601.06521, 2026

  16. [16]

    Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  17. [17]

    Pixelflow: Pixel-space generative models with flow

    Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963, 2025

  18. [18]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025. 35

  19. [19]

    A single transformer for scalable vision-language modeling.Transactions on Machine Learning Research, 2024

    Yangyi Chen, Xingyao Wang, Hao Peng, and Heng Ji. A single transformer for scalable vision-language modeling.Transactions on Machine Learning Research, 2024

  20. [20]

    Dip: Taming diffusion models in pixel space

    Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space. arXiv preprint arXiv:2511.18822, 2025

  21. [21]

    Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation

    Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135, 2024

  22. [22]

    Paddleocr 3.0 technical report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

  23. [23]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners. arXiv preprint arXiv:2510.26583, 2025

  24. [24]

    Gemini 3 pro image model card, Nov 2025

    Google DeepMind. Gemini 3 pro image model card, Nov 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Image-Model-Card.pdf

  25. [25]

    Gemini 2.0 flash, 2025

    Google DeepMind. Gemini 2.0 flash, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation

  26. [26]

    Gemini 2.5 flash & gemini 2.5 flash image model card, 2025

    Google DeepMind. Gemini 2.5 flash & gemini 2.5 flash image model card, 2025. URL https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf

  27. [27]

    Gemma 4: Byte for byte, the most capable open models, 2026

    Google DeepMind. Gemma 4: Byte for byte, the most capable open models, 2026. URL https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/

  28. [28]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025

  29. [29]

    Unveiling encoder-free vision-language models

    Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. Advances in Neural Information Processing Systems, 37:52545–52567, 2024

  30. [30]

    From pixels to words–towards native vision-language primitives at scale

    Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words–towards native vision-language primitives at scale. arXiv preprint arXiv:2510.14979, 2025

  31. [31]

    Evev2: Improved baselines for encoder-free vision-language models

    Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21014–21025, 2025

  32. [32]

    arXiv preprint arXiv:2503.23461 (2025)

    Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461, 2025

  33. [33]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025

    Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025

  34. [34]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, 2024

  35. [35]

    The prism hypothesis: Harmonizing semantic and pixel representations via unified autoencoding

    Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, and Ziwei Liu. The prism hypothesis: Harmonizing semantic and pixel representations via unified autoencoding. arXiv preprint arXiv:2512.19693, 2025

  36. [36]

    Phased dmd: Few-step distribution matching distillation via score matching within subintervals

    Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, and Lei Yang. Phased dmd: Few-step distribution matching distillation via score matching within subintervals. arXiv preprint arXiv:2510.27684, 2025

  37. [37]

    OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

    Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning. arXiv preprint arXiv:2501.00321, 2024

  38. [38]

    Seedream 3.0 Technical Report

    Peiyuan Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Luo, Long Zhang, Guo Chen, Shengyu Zhang, et al. Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346, 2025

  39. [39]

    Making llama see and draw with seed tokenizer

    Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023. 36

  40. [40]

    Seed-x: Multimodal models with unified multi-granularity comprehension and generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

  41. [41]

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolog...

  42. [42]

    arXiv preprint arXiv:2507.22058 (2025)

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again. arXiv preprint arXiv:2507.22058, 2025

  43. [43]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  44. [44]

    Past-future scheduler for llm serving under sla guarantees

    Ruihao Gong, Shihao Bai, Siyu Wu, Yunqian Fan, Zaijun Wang, Xiuhong Li, Hailong Yang, and Xianglong Liu. Past-future scheduler for llm serving under sla guarantees. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, pages 798–813, 2025. ISBN 9798400710797

  45. [45]

    Imagen 4 model card, 2025

    Google. Imagen 4 model card, 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Imagen-4-Model-Card.pdf

  46. [46]

    Nano banana 2: Combining pro capabilities with lightning-fast speed, 2026

    Google. Nano banana 2: Combining pro capabilities with lightning-fast speed, 2026. URL https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/

  47. [47]

    Gemini 3 Pro Model Card, November 2025

    Google DeepMind. Gemini 3 Pro Model Card, November 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf

  48. [48]

    Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

    Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492, 2025

  49. [49]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pa...

  50. [50]

    Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

    Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025

  51. [51]

    Ai2d-rst

    Tuomo Hiippala, Malihe Alikhani, Jonas Haverinen, Timo Kalliokoski, Evanfiya Logacheva, Serafina Orekhova, Aino Tuomainen, Matthew Stone, and John A Bateman. Ai2d-rst. Language Resources and Evaluation, 55(3):661–688, 2021

  52. [52]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025

  53. [53]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024

  54. [54]

    Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement

    Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, et al. Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934, 2025

  55. [55]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36:62991–63010, 2023

  56. [56]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  57. [57]

    Kimi K2: Open Agentic Intelligence

    Kimi Team. Kimi K2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025

  58. [58]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  59. [59]

    Kolors 2.0, 2025

    Kuaishou Kolors Team. Kolors 2.0, 2025. URLhttps://kolors.kuaishou.com/. 37

  60. [60]

    Flux, 2024

    Black Forest Labs. Flux, 2024. URLhttps://github.com/black-forest-labs/flux

  61. [61]

    FLUX.2: Frontier Visual Intelligence, 2025

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence, 2025. URLhttps://bfl.ai/blog/flux-2

  62. [62]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742, 2025

  63. [63]

    The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer

    Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20758–20769, 2025

  64. [64]

    Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

  65. [65]

    Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500, 2025

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models. arXiv preprint arXiv:2505.21500, 2025

  66. [66]

    Onecat: Decoder-only auto-regressive model for unified understanding and generation

    Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation. arXiv preprint arXiv:2509.03498, 2025

  67. [67]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

  68. [68]

    arXiv preprint arXiv:2511.01833 (2025)

    Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning. arXiv preprint arXiv:2511.01833, 2025

  69. [69]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720, 2025

  70. [70]

    Breen: bridge data-efficient encoder-free multimodal learning with learnable queries

    Tianle Li, Yongming Rao, Winston Hu, and Yu Cheng. Breen: bridge data-efficient encoder-free multimodal learning with learnable queries. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5384–5395, 2026

  71. [71]

    Bizgeneval: A systematic benchmark for commercial visual content generation

    Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, et al. Bizgeneval: A systematic benchmark for commercial visual content generation. arXiv preprint arXiv:2603.25732, 2026

  72. [72]

    Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback

    Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888, 2025

  73. [73]

    Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996, 2024

  74. [74]

    Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472, 2025

  75. [75]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025

  76. [76]

    Toklip: Marry visual tokens to clip for multimodal comprehension and generation

    Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and Ying Shan. Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422, 2025

  77. [77]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755, 2014

  78. [78]

    Moma: Efficient early-fusion pre-training with mixture of modality-aware experts

    Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, and Armen Agha- janyan. Moma: Efficient early-fusion pre-training with mixture of modality-aware experts. arXiv preprint arXiv:2407.21770, 2024. 38

  79. [79]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in Neural Information Processing Systems, 36:34892–34916, 2023

  80. [80]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470, 2025

Showing first 80 references.