Recognition: 1 theorem link
· Lean TheoremSenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
Pith reviewed 2026-05-13 05:06 UTC · model grok-4.3
The pith
SenseNova-U1 treats multimodal understanding and generation as synergistic views of one underlying process using the NEO-unify architecture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. Two variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, are built on dense and mixture-of-experts baselines. They match top understanding-only VLMs on text, vision-language perception, knowledge reasoning, agentic decisions, and spatial tasks. At the same time they produce strong semantic consistency and visual fidelity in any-to-image synthesis, complex infographic generation, and interleaved vision-language outputs, with or without explicit think patterns. Preliminary results also show capability in VLA (
What carries the argument
The NEO-unify architecture, which designs understanding and generation to develop as synergistic views of a single underlying process rather than distinct modules or cascaded stages.
If this is right
- The models match top-tier understanding-only VLMs on text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence.
- They achieve strong semantic consistency and visual fidelity in any-to-image synthesis, text-rich infographic generation, and interleaved vision-language generation.
- Preliminary results extend the same models to vision-language-action and world-model scenarios without separate modules.
- Multimodal systems can move from translating between modalities to thinking and acting across them natively.
Where Pith is reading between the lines
- A working unified process would eliminate the need for separate representation alignment steps between understanding and generation pipelines.
- If capabilities emerge internally, future scaling efforts could focus on one shared training regime instead of maintaining dual specialized systems.
- The approach suggests testing whether the same architecture supports additional modalities or longer-horizon planning without adding explicit bridges.
- Direct measurement of training stability and representation overlap during joint optimization would reveal whether the claimed absence of trade-offs holds in practice.
Load-bearing premise
The split between understanding and generation arises from a structural limitation in current designs rather than an unavoidable engineering necessity, and the NEO-unify architecture can integrate them without hidden costs to alignment or stability.
What would settle it
If the unified SenseNova-U1 models show clear performance drops relative to separate top understanding VLMs on standard benchmarks or to specialized generators on image fidelity and consistency metrics, the claim that unification removes structural barriers would not hold.
read the original abstract
Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SenseNova-U1, a native unified multimodal paradigm built on the NEO-unify architecture, in which understanding and generation are treated as synergistic views of a single underlying process rather than separate problems. It presents two variants (SenseNova-U1-8B-MoT on a dense 8B baseline and SenseNova-U1-A3B-MoT on a 30B-A3B MoE baseline) that claim to match or rival top-tier understanding-only VLMs on text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence tasks, while also delivering strong performance on any-to-image synthesis, text-rich infographic generation, and interleaved vision-language generation. The manuscript further describes model design, data preprocessing, pre-/post-training, and inference strategies, and provides preliminary evidence of extension to vision-language-action (VLA) and world model (WM) scenarios.
Significance. If the empirical claims and architectural unification hold, the work would be significant for demonstrating that a single native architecture can eliminate the typical fragmentation between understanding and generation pipelines without incurring hidden trade-offs in representation alignment or task performance. The explicit release of design details, data strategies, and preliminary VLA/WM results would support reproducibility and further research toward models that think and act natively across modalities rather than translating between them.
major comments (2)
- [Abstract and §3 (Architecture)] The central unification claim—that the understanding-generation dichotomy is structural rather than an engineering choice and that NEO-unify enables synergistic emergence without trade-offs—requires explicit supporting evidence. The abstract asserts rivaling performance on understanding benchmarks while adding generation capabilities, but no quantitative comparisons, ablation tables, or alignment metrics are referenced to demonstrate that the unified model avoids the expected degradation in either task family.
- [§4 (Model Design) and §5 (Training)] The variants are described as built on 'dense (8B) and mixture-of-experts (30B-A3B) understanding baselines,' yet no details are given on how the NEO-unify modifications (e.g., shared representation spaces or joint training objectives) are implemented or evaluated for stability. Without these, it is impossible to assess whether the reported performance stems from the unification or from scaling the underlying baselines.
minor comments (3)
- [Abstract] The abstract uses the term 'MoT' without expansion; clarify whether this denotes Mixture-of-Transformers, Mixture-of-Experts with specific routing, or another mechanism.
- [§6 (Experiments)] Claims of 'strong semantic consistency and visual fidelity' in generation tasks would benefit from reference to specific metrics (e.g., FID, CLIPScore, or human preference scores) and comparison baselines in the results section.
- [§7 (Extensions)] The preliminary VLA and WM results are described only at high level; adding even summary tables or qualitative examples would strengthen the broader roadmap argument.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential significance of demonstrating a native unified multimodal architecture. We address the major comments point by point below, providing clarifications from the manuscript and committing to targeted revisions that strengthen the evidence for the unification claims without misrepresenting our results.
read point-by-point responses
-
Referee: [Abstract and §3 (Architecture)] The central unification claim—that the understanding-generation dichotomy is structural rather than an engineering choice and that NEO-unify enables synergistic emergence without trade-offs—requires explicit supporting evidence. The abstract asserts rivaling performance on understanding benchmarks while adding generation capabilities, but no quantitative comparisons, ablation tables, or alignment metrics are referenced to demonstrate that the unified model avoids the expected degradation in either task family.
Authors: We acknowledge that while the manuscript reports competitive results on understanding benchmarks (text, perception, reasoning, agentic, and spatial tasks) alongside strong generation performance (X2I, infographics, interleaved), the initial version did not include dedicated ablations or alignment metrics to isolate the absence of trade-offs. In the revised manuscript we will add a new subsection under §3 that presents: (i) side-by-side benchmark tables comparing SenseNova-U1 variants to their unmodified understanding baselines and to separately trained generation models, (ii) an ablation on the joint training objective versus sequential training, and (iii) quantitative representation alignment metrics (e.g., cross-modal cosine similarity and mutual information scores) before and after unification. These additions will directly support the synergistic-emergence claim. revision: yes
-
Referee: [§4 (Model Design) and §5 (Training)] The variants are described as built on 'dense (8B) and mixture-of-experts (30B-A3B) understanding baselines,' yet no details are given on how the NEO-unify modifications (e.g., shared representation spaces or joint training objectives) are implemented or evaluated for stability. Without these, it is impossible to assess whether the reported performance stems from the unification or from scaling the underlying baselines.
Authors: We agree that greater implementation transparency is required. Although §4 and §5 outline the overall design and training pipeline, they lack the granular specifications needed for full reproducibility. In the revision we will expand both sections with: (1) pseudocode and diagrams detailing the construction of the shared representation space via the MoT layers, (2) the exact formulation of the joint loss combining understanding and generation objectives, and (3) stability measures (gradient norms, learning-rate schedules, and curriculum strategies) together with ablation results that quantify the incremental contribution of each NEO-unify modification over the unmodified baselines. This will allow readers to distinguish unification effects from baseline scaling. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces SenseNova-U1 as a new unified architecture (NEO-unify) and supports its claims through empirical performance on understanding and generation tasks. No equations, fitted parameters, or derivations are presented in the abstract that reduce to inputs by construction. The central argument—that the understanding-generation divide is structural and that unification enables synergistic emergence—is framed as a design premise validated by results, not a self-referential tautology or self-citation chain. Full text verification would be needed for any hidden steps, but none are detectable from the provided material.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The divide between understanding and generation is a structural limitation rather than an engineering artifact.
invented entities (1)
-
NEO-unify architecture
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022
work page 2022
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [3]
-
[4]
Introducing our multimodal models, 2023
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa ˘gnak Ta¸ sırlar. Introducing our multimodal models, 2023. URLhttps://www.adept.ai/blog/fuyu-8b
work page 2023
-
[5]
Improving image generation with better captions, 2023
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions, 2023. URLhttps://cdn.openai.com/papers/dall-e-3.pdf
work page 2023
-
[6]
ByteDance. Seedream 4.5, 2025. URLhttps://seed.bytedance.com/en/seedream4_5
work page 2025
-
[7]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
arXiv preprint arXiv:2505.22705 (2025)
Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: An open-source high-efficient image generative foundation model. arXiv preprint arXiv:2505.22705, 2025
-
[9]
Holistic evaluation of multimodal llms on spatial intelligence
Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holistic evaluation of multimodal llms on spatial intelligence. arXiv preprint arXiv:2508.13142, 2025
-
[10]
Scaling spatial intelligence with multimodal foundation models
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...
work page 2026
-
[11]
Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951, 2025
-
[12]
Oneig-bench: Omni-dimensional nuanced evaluation for image generation
Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. arXiv preprint arXiv:2506.07977, 2025
-
[13]
Aligning visual foundation encoders to tokenizers for diffusion models
Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models. arXiv preprint arXiv:2509.25162, 2025
-
[14]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard
Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language. arXiv preprint arXiv:2601.06521, 2026
-
[16]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056–27087, 2024
work page 2024
-
[17]
Pixelflow: Pixel-space generative models with flow
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963, 2025
-
[18]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025. 35
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Yangyi Chen, Xingyao Wang, Hao Peng, and Heng Ji. A single transformer for scalable vision-language modeling.Transactions on Machine Learning Research, 2024
work page 2024
-
[20]
Dip: Taming diffusion models in pixel space
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space. arXiv preprint arXiv:2511.18822, 2025
-
[21]
Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation
Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135, 2024
-
[22]
Paddleocr 3.0 technical report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025
- [23]
-
[24]
Gemini 3 pro image model card, Nov 2025
Google DeepMind. Gemini 3 pro image model card, Nov 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Image-Model-Card.pdf
work page 2025
-
[25]
Google DeepMind. Gemini 2.0 flash, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation
work page 2025
-
[26]
Gemini 2.5 flash & gemini 2.5 flash image model card, 2025
Google DeepMind. Gemini 2.5 flash & gemini 2.5 flash image model card, 2025. URL https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf
work page 2025
-
[27]
Gemma 4: Byte for byte, the most capable open models, 2026
Google DeepMind. Gemma 4: Byte for byte, the most capable open models, 2026. URL https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/
work page 2026
-
[28]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Unveiling encoder-free vision-language models
Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. Advances in Neural Information Processing Systems, 37:52545–52567, 2024
work page 2024
-
[30]
From pixels to words–towards native vision-language primitives at scale
Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words–towards native vision-language primitives at scale. arXiv preprint arXiv:2510.14979, 2025
-
[31]
Evev2: Improved baselines for encoder-free vision-language models
Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21014–21025, 2025
work page 2025
-
[32]
arXiv preprint arXiv:2503.23461 (2025)
Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461, 2025
-
[33]
Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025
Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025
-
[34]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, 2024
work page 2024
-
[35]
The prism hypothesis: Harmonizing semantic and pixel representations via unified autoencoding
Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, and Ziwei Liu. The prism hypothesis: Harmonizing semantic and pixel representations via unified autoencoding. arXiv preprint arXiv:2512.19693, 2025
-
[36]
Phased dmd: Few-step distribution matching distillation via score matching within subintervals
Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, and Lei Yang. Phased dmd: Few-step distribution matching distillation via score matching within subintervals. arXiv preprint arXiv:2510.27684, 2025
-
[37]
Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning. arXiv preprint arXiv:2501.00321, 2024
-
[38]
Peiyuan Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Luo, Long Zhang, Guo Chen, Shengyu Zhang, et al. Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346, 2025
work page internal anchor Pith review arXiv 2025
-
[39]
Making llama see and draw with seed tokenizer
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023. 36
-
[40]
Seed-x: Multimodal models with unified multi-granularity comprehension and generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024
-
[41]
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolog...
work page 2025
-
[42]
arXiv preprint arXiv:2507.22058 (2025)
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again. arXiv preprint arXiv:2507.22058, 2025
-
[43]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36:52132–52152, 2023
work page 2023
-
[44]
Past-future scheduler for llm serving under sla guarantees
Ruihao Gong, Shihao Bai, Siyu Wu, Yunqian Fan, Zaijun Wang, Xiuhong Li, Hailong Yang, and Xianglong Liu. Past-future scheduler for llm serving under sla guarantees. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, pages 798–813, 2025. ISBN 9798400710797
work page 2025
-
[45]
Google. Imagen 4 model card, 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Imagen-4-Model-Card.pdf
work page 2025
-
[46]
Nano banana 2: Combining pro capabilities with lightning-fast speed, 2026
Google. Nano banana 2: Combining pro capabilities with lightning-fast speed, 2026. URL https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/
work page 2026
-
[47]
Gemini 3 Pro Model Card, November 2025
Google DeepMind. Gemini 3 Pro Model Card, November 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf
work page 2025
-
[48]
Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492, 2025
-
[49]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pa...
work page 2024
-
[50]
Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025
work page 2025
- [51]
-
[52]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement
Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, et al. Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934, 2025
-
[55]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36:62991–63010, 2023
work page 2023
-
[56]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Kimi K2: Open Agentic Intelligence
Kimi Team. Kimi K2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[59]
Kuaishou Kolors Team. Kolors 2.0, 2025. URLhttps://kolors.kuaishou.com/. 37
work page 2025
-
[60]
Black Forest Labs. Flux, 2024. URLhttps://github.com/black-forest-labs/flux
work page 2024
-
[61]
FLUX.2: Frontier Visual Intelligence, 2025
Black Forest Labs. FLUX.2: Frontier Visual Intelligence, 2025. URLhttps://bfl.ai/blog/flux-2
work page 2025
-
[62]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20758–20769, 2025
work page 2025
-
[64]
Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025
work page 2025
-
[65]
Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models. arXiv preprint arXiv:2505.21500, 2025
-
[66]
Onecat: Decoder-only auto-regressive model for unified understanding and generation
Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation. arXiv preprint arXiv:2509.03498, 2025
-
[67]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023
work page 2023
-
[68]
arXiv preprint arXiv:2511.01833 (2025)
Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning. arXiv preprint arXiv:2511.01833, 2025
-
[69]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Breen: bridge data-efficient encoder-free multimodal learning with learnable queries
Tianle Li, Yongming Rao, Winston Hu, and Yu Cheng. Breen: bridge data-efficient encoder-free multimodal learning with learnable queries. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5384–5395, 2026
work page 2026
-
[71]
Bizgeneval: A systematic benchmark for commercial visual content generation
Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, et al. Bizgeneval: A systematic benchmark for commercial visual content generation. arXiv preprint arXiv:2603.25732, 2026
-
[72]
Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888, 2025
-
[73]
Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models
Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996, 2024
-
[74]
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472, 2025
-
[75]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
Toklip: Marry visual tokens to clip for multimodal comprehension and generation
Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and Ying Shan. Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422, 2025
-
[77]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755, 2014
work page 2014
-
[78]
Moma: Efficient early-fusion pre-training with mixture of modality-aware experts
Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, and Armen Agha- janyan. Moma: Efficient early-fusion pre-training with mixture of modality-aware experts. arXiv preprint arXiv:2407.21770, 2024. 38
-
[79]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in Neural Information Processing Systems, 36:34892–34916, 2023
work page 2023
-
[80]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.