arxiv: 2605.12500 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Bo Liu, Chengguang Lv, Dahua Lin, Haiwen Diao, HanMing Deng, Haojia Yu, Haozhe Xie, Hongli Wang, Jiahao Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lei Yang, Lewei Lu, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Pengfei Liu, Peng Gao, Penghao Wu, Qingping Sun, Quan Wang, Ruihao Gong, Rui Shen, Ruisi Wang, Shengnan Ma, Shihao Bai, Shuang Yang, Silei Wu, Siying Li, Siyi Xie, Tianbo Zhong, Weichen Fan, Wenjie Ye, Wenwen Tong, Wenxiu Sun, Xiangli Kong, Xiangyu Fan, Xuanke Shi, Yang Gao, Yan Li, Yongqiang Yao, Yubo Wang, Yue Zhu, Yuwei Niu, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Ziwei Liu, Zixin Yin

Pith reviewed 2026-05-13 05:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal unificationvision-language modelsunderstanding and generationNEO-unify architectureSenseNova-U1any-to-image synthesisinterleaved generationnative multimodal intelligence

0 comments

The pith

SenseNova-U1 treats multimodal understanding and generation as synergistic views of one underlying process using the NEO-unify architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that separating understanding from generation in vision-language models creates a built-in barrier to integrated intelligence rather than a fixable engineering issue. By designing from first principles around a single shared process, the SenseNova-U1 models allow both capabilities to develop together instead of in separate pipelines or misaligned spaces. The resulting 8B and 30B-scale variants match specialized understanding systems across perception, reasoning, and decision tasks while also producing coherent images, text-rich graphics, and interleaved outputs. Early tests further indicate the same models handle action and world-modeling scenarios without added components. This setup replaces translation between modalities with native cross-modal thinking.

Core claim

We introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. Two variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, are built on dense and mixture-of-experts baselines. They match top understanding-only VLMs on text, vision-language perception, knowledge reasoning, agentic decisions, and spatial tasks. At the same time they produce strong semantic consistency and visual fidelity in any-to-image synthesis, complex infographic generation, and interleaved vision-language outputs, with or without explicit think patterns. Preliminary results also show capability in VLA (

What carries the argument

The NEO-unify architecture, which designs understanding and generation to develop as synergistic views of a single underlying process rather than distinct modules or cascaded stages.

If this is right

The models match top-tier understanding-only VLMs on text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence.
They achieve strong semantic consistency and visual fidelity in any-to-image synthesis, text-rich infographic generation, and interleaved vision-language generation.
Preliminary results extend the same models to vision-language-action and world-model scenarios without separate modules.
Multimodal systems can move from translating between modalities to thinking and acting across them natively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A working unified process would eliminate the need for separate representation alignment steps between understanding and generation pipelines.
If capabilities emerge internally, future scaling efforts could focus on one shared training regime instead of maintaining dual specialized systems.
The approach suggests testing whether the same architecture supports additional modalities or longer-horizon planning without adding explicit bridges.
Direct measurement of training stability and representation overlap during joint optimization would reveal whether the claimed absence of trade-offs holds in practice.

Load-bearing premise

The split between understanding and generation arises from a structural limitation in current designs rather than an unavoidable engineering necessity, and the NEO-unify architecture can integrate them without hidden costs to alignment or stability.

What would settle it

If the unified SenseNova-U1 models show clear performance drops relative to separate top understanding VLMs on standard benchmarks or to specialized generators on image fidelity and consistency metrics, the claim that unification removes structural barriers would not hold.

read the original abstract

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SenseNova-U1 releases two concrete models under a new NEO-unify architecture that aims to treat understanding and generation as one process, but the supporting results stay high-level.

read the letter

The main thing here is that the authors built and open two models, an 8B dense version and a 30B-A3B MoE one, on top of their NEO-unify design. They argue the usual split between understanding and generation is structural, not just practical, and claim the models match strong understanding-only VLMs while also handling semantic image synthesis, text-rich infographics, and interleaved outputs. Early signs in vision-language-action and world-model tasks are mentioned as well. Releasing the models plus the design, data, training, and inference details is the clearest positive step; that kind of transparency gives others something to test or extend directly. The framing that capabilities can emerge synergistically from a single process is stated plainly and kept consistent with the unification goal. The softer spots are the missing specifics. Performance claims rest on assertions rather than visible numbers, ablations, or error bars in the summary, so it is still unclear how much unification actually costs in alignment or stability. The architecture is new in its named form and scaling choices, yet the high-level goal of unified VLMs has been explored before, so the advance is more in implementation than in first-principles novelty. The assumption that the dichotomy is structural will need tighter comparative evidence to stand. This work is for researchers and labs already building or evaluating next-generation VLMs who want a practical unification example with released checkpoints. It deserves a serious referee because the topic is central and the models provide something concrete to review, even if the empirical sections will need expansion and more rigorous validation.

Referee Report

2 major / 3 minor

Summary. The paper introduces SenseNova-U1, a native unified multimodal paradigm built on the NEO-unify architecture, in which understanding and generation are treated as synergistic views of a single underlying process rather than separate problems. It presents two variants (SenseNova-U1-8B-MoT on a dense 8B baseline and SenseNova-U1-A3B-MoT on a 30B-A3B MoE baseline) that claim to match or rival top-tier understanding-only VLMs on text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence tasks, while also delivering strong performance on any-to-image synthesis, text-rich infographic generation, and interleaved vision-language generation. The manuscript further describes model design, data preprocessing, pre-/post-training, and inference strategies, and provides preliminary evidence of extension to vision-language-action (VLA) and world model (WM) scenarios.

Significance. If the empirical claims and architectural unification hold, the work would be significant for demonstrating that a single native architecture can eliminate the typical fragmentation between understanding and generation pipelines without incurring hidden trade-offs in representation alignment or task performance. The explicit release of design details, data strategies, and preliminary VLA/WM results would support reproducibility and further research toward models that think and act natively across modalities rather than translating between them.

major comments (2)

[Abstract and §3 (Architecture)] The central unification claim—that the understanding-generation dichotomy is structural rather than an engineering choice and that NEO-unify enables synergistic emergence without trade-offs—requires explicit supporting evidence. The abstract asserts rivaling performance on understanding benchmarks while adding generation capabilities, but no quantitative comparisons, ablation tables, or alignment metrics are referenced to demonstrate that the unified model avoids the expected degradation in either task family.
[§4 (Model Design) and §5 (Training)] The variants are described as built on 'dense (8B) and mixture-of-experts (30B-A3B) understanding baselines,' yet no details are given on how the NEO-unify modifications (e.g., shared representation spaces or joint training objectives) are implemented or evaluated for stability. Without these, it is impossible to assess whether the reported performance stems from the unification or from scaling the underlying baselines.

minor comments (3)

[Abstract] The abstract uses the term 'MoT' without expansion; clarify whether this denotes Mixture-of-Transformers, Mixture-of-Experts with specific routing, or another mechanism.
[§6 (Experiments)] Claims of 'strong semantic consistency and visual fidelity' in generation tasks would benefit from reference to specific metrics (e.g., FID, CLIPScore, or human preference scores) and comparison baselines in the results section.
[§7 (Extensions)] The preliminary VLA and WM results are described only at high level; adding even summary tables or qualitative examples would strengthen the broader roadmap argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of demonstrating a native unified multimodal architecture. We address the major comments point by point below, providing clarifications from the manuscript and committing to targeted revisions that strengthen the evidence for the unification claims without misrepresenting our results.

read point-by-point responses

Referee: [Abstract and §3 (Architecture)] The central unification claim—that the understanding-generation dichotomy is structural rather than an engineering choice and that NEO-unify enables synergistic emergence without trade-offs—requires explicit supporting evidence. The abstract asserts rivaling performance on understanding benchmarks while adding generation capabilities, but no quantitative comparisons, ablation tables, or alignment metrics are referenced to demonstrate that the unified model avoids the expected degradation in either task family.

Authors: We acknowledge that while the manuscript reports competitive results on understanding benchmarks (text, perception, reasoning, agentic, and spatial tasks) alongside strong generation performance (X2I, infographics, interleaved), the initial version did not include dedicated ablations or alignment metrics to isolate the absence of trade-offs. In the revised manuscript we will add a new subsection under §3 that presents: (i) side-by-side benchmark tables comparing SenseNova-U1 variants to their unmodified understanding baselines and to separately trained generation models, (ii) an ablation on the joint training objective versus sequential training, and (iii) quantitative representation alignment metrics (e.g., cross-modal cosine similarity and mutual information scores) before and after unification. These additions will directly support the synergistic-emergence claim. revision: yes
Referee: [§4 (Model Design) and §5 (Training)] The variants are described as built on 'dense (8B) and mixture-of-experts (30B-A3B) understanding baselines,' yet no details are given on how the NEO-unify modifications (e.g., shared representation spaces or joint training objectives) are implemented or evaluated for stability. Without these, it is impossible to assess whether the reported performance stems from the unification or from scaling the underlying baselines.

Authors: We agree that greater implementation transparency is required. Although §4 and §5 outline the overall design and training pipeline, they lack the granular specifications needed for full reproducibility. In the revision we will expand both sections with: (1) pseudocode and diagrams detailing the construction of the shared representation space via the MoT layers, (2) the exact formulation of the joint loss combining understanding and generation objectives, and (3) stability measures (gradient norms, learning-rate schedules, and curriculum strategies) together with ablation results that quantify the incremental contribution of each NEO-unify modification over the unmodified baselines. This will allow readers to distinguish unification effects from baseline scaling. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces SenseNova-U1 as a new unified architecture (NEO-unify) and supports its claims through empirical performance on understanding and generation tasks. No equations, fitted parameters, or derivations are presented in the abstract that reduce to inputs by construction. The central argument—that the understanding-generation divide is structural and that unification enables synergistic emergence—is framed as a design premise validated by results, not a self-referential tautology or self-citation chain. Full text verification would be needed for any hidden steps, but none are detectable from the provided material.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is limited to the high-level premise stated in the text.

axioms (1)

domain assumption The divide between understanding and generation is a structural limitation rather than an engineering artifact.
Stated directly in the abstract as the starting argument for introducing the unified paradigm.

invented entities (1)

NEO-unify architecture no independent evidence
purpose: Single underlying process that makes understanding and generation synergistic views.
Introduced as the core technical contribution enabling native unification.

pith-pipeline@v0.9.0 · 5842 in / 1397 out tokens · 60627 ms · 2026-05-13T05:06:23.230885+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

173 extracted references · 173 canonical work pages · 40 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022

work page 2022
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Imagen 3

Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3. arXiv preprint arXiv:2408.07009, 2024

work page arXiv 2024
[4]

Introducing our multimodal models, 2023

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa ˘gnak Ta¸ sırlar. Introducing our multimodal models, 2023. URLhttps://www.adept.ai/blog/fuyu-8b

work page 2023
[5]

Improving image generation with better captions, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions, 2023. URLhttps://cdn.openai.com/papers/dall-e-3.pdf

work page 2023
[6]

Seedream 4.5, 2025

ByteDance. Seedream 4.5, 2025. URLhttps://seed.bytedance.com/en/seedream4_5

work page 2025
[7]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

arXiv preprint arXiv:2505.22705 (2025)

Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: An open-source high-efficient image generative foundation model. arXiv preprint arXiv:2505.22705, 2025

work page arXiv 2025
[9]

Holistic evaluation of multimodal llms on spatial intelligence

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holistic evaluation of multimodal llms on spatial intelligence. arXiv preprint arXiv:2508.13142, 2025

work page arXiv 2025
[10]

Scaling spatial intelligence with multimodal foundation models

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...

work page 2026
[11]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951, 2025

work page arXiv 2025
[12]

Oneig-bench: Omni-dimensional nuanced evaluation for image generation

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. arXiv preprint arXiv:2506.07977, 2025

work page arXiv 2025
[13]

Aligning visual foundation encoders to tokenizers for diffusion models

Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models. arXiv preprint arXiv:2509.25162, 2025

work page arXiv 2025
[14]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard

Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language. arXiv preprint arXiv:2601.06521, 2026

work page arXiv 2026
[16]

Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024
[17]

Pixelflow: Pixel-space generative models with flow

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963, 2025

work page arXiv 2025
[18]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025. 35

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

A single transformer for scalable vision-language modeling.Transactions on Machine Learning Research, 2024

Yangyi Chen, Xingyao Wang, Hao Peng, and Heng Ji. A single transformer for scalable vision-language modeling.Transactions on Machine Learning Research, 2024

work page 2024
[20]

Dip: Taming diffusion models in pixel space

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space. arXiv preprint arXiv:2511.18822, 2025

work page arXiv 2025
[21]

Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation

Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135, 2024

work page arXiv 2024
[22]

Paddleocr 3.0 technical report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

work page arXiv 2025
[23]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners. arXiv preprint arXiv:2510.26583, 2025

work page arXiv 2025
[24]

Gemini 3 pro image model card, Nov 2025

Google DeepMind. Gemini 3 pro image model card, Nov 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Image-Model-Card.pdf

work page 2025
[25]

Gemini 2.0 flash, 2025

Google DeepMind. Gemini 2.0 flash, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation

work page 2025
[26]

Gemini 2.5 flash & gemini 2.5 flash image model card, 2025

Google DeepMind. Gemini 2.5 flash & gemini 2.5 flash image model card, 2025. URL https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf

work page 2025
[27]

Gemma 4: Byte for byte, the most capable open models, 2026

Google DeepMind. Gemma 4: Byte for byte, the most capable open models, 2026. URL https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/

work page 2026
[28]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Unveiling encoder-free vision-language models

Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. Advances in Neural Information Processing Systems, 37:52545–52567, 2024

work page 2024
[30]

From pixels to words–towards native vision-language primitives at scale

Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words–towards native vision-language primitives at scale. arXiv preprint arXiv:2510.14979, 2025

work page arXiv 2025
[31]

Evev2: Improved baselines for encoder-free vision-language models

Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21014–21025, 2025

work page 2025
[32]

arXiv preprint arXiv:2503.23461 (2025)

Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461, 2025

work page arXiv 2025
[33]

Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025

Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025

work page arXiv 2025
[34]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, 2024

work page 2024
[35]

The prism hypothesis: Harmonizing semantic and pixel representations via unified autoencoding

Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, and Ziwei Liu. The prism hypothesis: Harmonizing semantic and pixel representations via unified autoencoding. arXiv preprint arXiv:2512.19693, 2025

work page arXiv 2025
[36]

Phased dmd: Few-step distribution matching distillation via score matching within subintervals

Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, and Lei Yang. Phased dmd: Few-step distribution matching distillation via score matching within subintervals. arXiv preprint arXiv:2510.27684, 2025

work page arXiv 2025
[37]

OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning. arXiv preprint arXiv:2501.00321, 2024

work page arXiv 2024
[38]

Seedream 3.0 Technical Report

Peiyuan Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Luo, Long Zhang, Guo Chen, Shengyu Zhang, et al. Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346, 2025

work page internal anchor Pith review arXiv 2025
[39]

Making llama see and draw with seed tokenizer

Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023. 36

work page arXiv 2023
[40]

Seed-x: Multimodal models with unified multi-granularity comprehension and generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

work page arXiv 2024
[41]

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolog...

work page 2025
[42]

arXiv preprint arXiv:2507.22058 (2025)

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again. arXiv preprint arXiv:2507.22058, 2025

work page arXiv 2025
[43]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023
[44]

Past-future scheduler for llm serving under sla guarantees

Ruihao Gong, Shihao Bai, Siyu Wu, Yunqian Fan, Zaijun Wang, Xiuhong Li, Hailong Yang, and Xianglong Liu. Past-future scheduler for llm serving under sla guarantees. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, pages 798–813, 2025. ISBN 9798400710797

work page 2025
[45]

Imagen 4 model card, 2025

Google. Imagen 4 model card, 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Imagen-4-Model-Card.pdf

work page 2025
[46]

Nano banana 2: Combining pro capabilities with lightning-fast speed, 2026

Google. Nano banana 2: Combining pro capabilities with lightning-fast speed, 2026. URL https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/

work page 2026
[47]

Gemini 3 Pro Model Card, November 2025

Google DeepMind. Gemini 3 Pro Model Card, November 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf

work page 2025
[48]

Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492, 2025

work page arXiv 2025
[49]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pa...

work page 2024
[50]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025

work page 2025
[51]

Ai2d-rst

Tuomo Hiippala, Malihe Alikhani, Jonas Haverinen, Timo Kalliokoski, Evanfiya Logacheva, Serafina Orekhova, Aino Tuomainen, Matthew Stone, and John A Bateman. Ai2d-rst. Language Resources and Evaluation, 55(3):661–688, 2021

work page 2021
[52]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement

Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, et al. Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934, 2025

work page arXiv 2025
[55]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36:62991–63010, 2023

work page 2023
[56]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Kimi K2: Open Agentic Intelligence

Kimi Team. Kimi K2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[59]

Kolors 2.0, 2025

Kuaishou Kolors Team. Kolors 2.0, 2025. URLhttps://kolors.kuaishou.com/. 37

work page 2025
[60]

Flux, 2024

Black Forest Labs. Flux, 2024. URLhttps://github.com/black-forest-labs/flux

work page 2024
[61]

FLUX.2: Frontier Visual Intelligence, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence, 2025. URLhttps://bfl.ai/blog/flux-2

work page 2025
[62]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer

Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20758–20769, 2025

work page 2025
[64]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

work page 2025
[65]

Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500, 2025

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models. arXiv preprint arXiv:2505.21500, 2025

work page arXiv 2025
[66]

Onecat: Decoder-only auto-regressive model for unified understanding and generation

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation. arXiv preprint arXiv:2509.03498, 2025

work page arXiv 2025
[67]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[68]

arXiv preprint arXiv:2511.01833 (2025)

Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning. arXiv preprint arXiv:2511.01833, 2025

work page arXiv 2025
[69]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Breen: bridge data-efficient encoder-free multimodal learning with learnable queries

Tianle Li, Yongming Rao, Winston Hu, and Yu Cheng. Breen: bridge data-efficient encoder-free multimodal learning with learnable queries. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5384–5395, 2026

work page 2026
[71]

Bizgeneval: A systematic benchmark for commercial visual content generation

Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, et al. Bizgeneval: A systematic benchmark for commercial visual content generation. arXiv preprint arXiv:2603.25732, 2026

work page arXiv 2026
[72]

Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888, 2025

work page arXiv 2025
[73]

Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models

Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996, 2024

work page arXiv 2024
[74]

Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472, 2025

work page arXiv 2025
[75]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Toklip: Marry visual tokens to clip for multimodal comprehension and generation

Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and Ying Shan. Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422, 2025

work page arXiv 2025
[77]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755, 2014

work page 2014
[78]

Moma: Efficient early-fusion pre-training with mixture of modality-aware experts

Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, and Armen Agha- janyan. Moma: Efficient early-fusion pre-training with mixture of modality-aware experts. arXiv preprint arXiv:2407.21770, 2024. 38

work page arXiv 2024
[79]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in Neural Information Processing Systems, 36:34892–34916, 2023

work page 2023
[80]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.