arxiv: 2605.04128 · v1 · submitted 2026-05-05 · 💻 cs.GR · cs.AI· cs.CL· cs.CV· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Bo Wang, Guohui Zhang, Guoqing Ma, Hang Xu, Haoyang Huang, Haoze Sun, Jianhui Liu, Lin Song, Maoquan Zhang, Nan Duan, Nan Jiang, Wei Tang, Wenbo Li, Wenhu Zhang, Xin Han, Yanbing Zhang, Yicheng Xiao, Yijun Yang, Yuan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:51 UTC · model grok-4.3

classification 💻 cs.GR cs.AIcs.CLcs.CVcs.LG

keywords multimodal foundation modelspatial intelligenceimage generationimage editingmultimodal LLMdiffusion transformerunified traininggeometry-aware reasoning

0 comments

The pith

A bidirectional loop between understanding, spatial editing, and novel-view reasoning strengthens spatial intelligence beyond general visual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JoyAI-Image, a unified multimodal model that integrates visual understanding, text-to-image generation, and instruction-guided editing. It achieves this by coupling a spatially enhanced multimodal large language model with a multimodal diffusion transformer, supported by a training recipe of unified instruction tuning, spatially grounded data, and editing signals. The central idea is that these components create a bidirectional loop allowing enhanced understanding to improve editing and reasoning, and vice versa. This matters because it suggests a way to build models with true spatial intelligence rather than just broad visual competence, potentially benefiting applications like robotics and simulations.

Core claim

The bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence.

What carries the argument

The bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning that lets perception and generation interact via a shared interface.

Load-bearing premise

That integrating the architecture and training recipe actually delivers measurable improvements in spatial intelligence not achievable by separate models.

What would settle it

A direct comparison of spatial reasoning benchmarks between the unified model and separate understanding plus generation models, with ablations removing the bidirectional interactions or spatial data.

read the original abstract

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JoyAI-Image pairs a spatially enhanced MLLM with MMDiT and a training recipe heavy on grounded data and editing signals, but the abstract gives no ablations or numbers to show the bidirectional loop adds anything beyond the data itself.

read the letter

JoyAI-Image is a new named model that couples a spatially enhanced MLLM with an MMDiT through a shared interface. The training recipe mixes unified instruction tuning, long-text rendering, spatially grounded data, and editing signals. The abstract says this produces competitive results on understanding, generation, and editing benchmarks while moving the model toward stronger spatial intelligence via a bidirectional loop of understanding, editing, and novel-view reasoning.

Referee Report

2 major / 2 minor

Summary. The paper presents JoyAI-Image, a unified multimodal foundation model coupling a spatially enhanced MLLM with an MMDiT via a shared interface for visual understanding, text-to-image generation, and instruction-guided editing. It describes a training recipe combining unified instruction tuning, long-text rendering supervision, spatially grounded data, and general/spatial editing signals. The manuscript reports SOTA or competitive performance across understanding, generation, long-text rendering, and editing benchmarks, with the central claim that the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning produces stronger spatial intelligence beyond general visual competence.

Significance. If the central claims hold with proper validation, this would mark a meaningful advance in unified multimodal models by demonstrating how bidirectional perception-generation loops can strengthen spatial reasoning, with direct relevance to downstream uses such as vision-language-action systems and world models. The shared MLLM/MMDiT interface and the scalable training recipe that incorporates spatially grounded signals constitute a clear engineering strength.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: The claim that the bidirectional loop (via shared interface, editing signals, and novel-view reasoning) enables stronger spatial intelligence is not supported by any ablation studies or controlled comparisons (e.g., full model versus a version lacking the editing-to-understanding feedback or the shared multimodal interface). Observed benchmark gains could therefore be attributable entirely to the spatially grounded training corpus rather than the asserted loop.
[Abstract] Abstract: While SOTA or competitive performance is asserted on understanding, generation, and editing benchmarks, no specific metrics, baselines, error bars, or ablation tables are provided to ground these claims or to quantify the spatial intelligence improvements.

minor comments (2)

[Abstract] The abstract is lengthy and would benefit from including one or two concrete quantitative results to illustrate the performance claims.
Notation for the shared multimodal interface and the precise definition of 'spatial intelligence' should be introduced earlier and used consistently throughout the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on JoyAI-Image. The comments highlight important areas for strengthening the evidence behind our claims on spatial intelligence and for making quantitative results more explicit. We will revise the manuscript to address these points directly.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The claim that the bidirectional loop (via shared interface, editing signals, and novel-view reasoning) enables stronger spatial intelligence is not supported by any ablation studies or controlled comparisons (e.g., full model versus a version lacking the editing-to-understanding feedback or the shared multimodal interface). Observed benchmark gains could therefore be attributable entirely to the spatially grounded training corpus rather than the asserted loop.

Authors: We agree that the manuscript would be strengthened by explicit ablation studies isolating the bidirectional loop. The current version relies on the integrated training recipe (unified instruction tuning, spatially grounded data, and editing signals) to support the claim, but does not present controlled variants removing the shared MLLM-MMDiT interface or the editing-to-understanding feedback. In the revision we will add such ablations, including comparisons of the full model against versions trained without the shared interface or without the novel-view-assisted reasoning signals, to quantify the loop's contribution beyond the data alone. revision: yes
Referee: [Abstract] Abstract: While SOTA or competitive performance is asserted on understanding, generation, and editing benchmarks, no specific metrics, baselines, error bars, or ablation tables are provided to ground these claims or to quantify the spatial intelligence improvements.

Authors: We acknowledge that the abstract currently summarizes results qualitatively. The full experiments section contains detailed tables with benchmark scores, baselines, and comparisons across understanding, generation, long-text rendering, and editing tasks. In the revised manuscript we will update the abstract to include key quantitative metrics (e.g., specific scores on relevant benchmarks) and references to the corresponding tables and baselines. We will also ensure any reported variance or error bars from our experiments are noted where applicable. revision: yes

Circularity Check

0 steps flagged

No circular derivation; architectural claims are self-contained empirical description

full rationale

The paper presents a unified model (MLLM + MMDiT) and training recipe (unified instruction tuning, spatially grounded data, editing signals) whose claimed outcome is enhanced spatial intelligence via bidirectional interaction. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The central statement that the loop 'enables the model to move beyond general visual competence' is an empirical assertion tied to benchmark performance, not a reduction of the result to its inputs by construction. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, parameters, or explicit assumptions; insufficient detail to enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5543 in / 1096 out tokens · 88971 ms · 2026-05-08T17:51:51.853066+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

105 extracted references · 59 canonical work pages · 25 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv:2303.08774, 2023

work page internal anchor Pith review arXiv 2023
[2]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

2025
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Xionghui Chen, Qidong Huang, Kaixin Li, Zicheng Lin, Keming Zhu, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025. URLhttps://arxiv.org/abs/2511.21631

work page internal anchor Pith review arXiv 2025
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv:2502.13923, 2025

work page Pith review arXiv 2025
[5]

Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv:2111.08897, 2021

work page arXiv 2021
[6]

Black Forest Labs, 2024

Black Forest Labs.FLUX.1 [Dev]. Black Forest Labs, 2024. URL https://huggingface.co/black-forest-labs/FLUX.1-dev. Official model card, accessed 2026-03-24

2024
[7]

Blender, 2024

Blender Foundation. Blender, 2024. URLhttps://www.blender.org/

2024
[8]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

2023
[9]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-firstInternational Conference on Machine Learning, 2024

2024
[10]

Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, Yimeng Wang, Kai Yu, Wenxuan Chen, Ziwei Feng, Zijian Gong, Jianzhuang Pan, Yi Peng, Rui Tian, Siyu Wang, Bo Zhao, Ting Yao, and Tao Mei. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.ar...

work page arXiv 2025
[11]

Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding

Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. Artimuse: Fine-grained image aesthetics assessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025

work page arXiv 2025
[12]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page Pith review arXiv 2025
[13]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review arXiv 2025
[14]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv:1709.06158, 2017

work page Pith review arXiv 2017
[15]

OneIG-bench: Omni-dimensional nuanced evaluation for image generation

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang YU, and Hai-Bao Chen. OneIG-bench: Omni-dimensional nuanced evaluation for image generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=S9TQM1Uhpl. 42

2025
[16]

Textdiffuser-2: Unleashing the power of language models for text rendering

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

2024
[17]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. URL https://arxiv.org/abs/2505.09568

work page Pith review arXiv 2025
[18]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024

2024
[19]

Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe TwelfthInternational Conference on Learning Representations, 2024

2024
[20]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv:2403.20330, 2024

work page internal anchor Pith review arXiv 2024
[21]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. URLhttps://arxiv.org/abs/2501.17811

work page internal anchor Pith review arXiv 2025
[22]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv:2412.05271, 2024

work page internal anchor Pith review arXiv 2024
[23]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025

work page Pith review arXiv 2025
[24]

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl-1.5: Towards a multi-task 0.9 b vlm for robust in-the-wild document parsing. arXiv preprint arXiv:2601.21957, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

5: Native Multimodal Models are World Learners , author=

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, and Xinlong Wang. Emu3.5: Native multimodal models are world learners, 2025. URLh...

work page arXiv 2025
[26]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017

2017
[28]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. URLhttps://arxiv.org/abs/2505.14683

work page internal anchor Pith review arXiv 2025
[29]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024

2024
[30]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

2024
[31]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InECCV, 2024

2024
[32]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei 43 Zhang, Shijia Zhao, Jianchao Yang, and Weilin...

work page internal anchor Pith review arXiv 2025
[33]

arXiv preprint arXiv:2507.22058 (2025)

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.CoRR, abs/2507.22058, 2025

work page arXiv 2025
[34]

A new era of intelligence with gemini 3

Google. A new era of intelligence with gemini 3. https://blog.google/products-and-platforms/products/gemini/gemini-3/, November 2025. Google Blog

2025
[35]

Nano banana pro

Google. Nano banana pro. Gemini 3 Pro Image Model Card, 2025

2025
[36]

Introducing veo 3, our video generation model with expanded creative controls – including native audio and extended videos.https://deepmind.google/models/veo/, 2025

Google. Introducing veo 3, our video generation model with expanded creative controls – including native audio and extended videos.https://deepmind.google/models/veo/, 2025

2025
[37]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review arXiv 2025
[38]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review arXiv 2024
[39]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

2021
[40]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv:2406.09246, 2024

work page internal anchor Pith review arXiv 2024
[41]

Kling. Kling. Kling. Accessed Sept.30, 2024 [Online]https://kling.kuaishou.com/en, 2024. URL https://kling.kuaishou.com/en

2024
[42]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[43]

FLUX.2: State-of-the-Art Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: State-of-the-Art Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025
[44]

Easier painting than thinking: Can text-to-image models set the stage, but not direct the play?arXiv preprint arXiv:2509.03516, 2025

Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Xiaojuan Qi, and Fuli Feng. Easier painting than thinking: Can text-to-image models set the stage, but not direct the play?, 2026. URLhttps://arxiv.org/abs/2509.03516

work page arXiv 2026
[46]

Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

work page arXiv 2025
[47]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review arXiv 2025
[48]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page Pith review arXiv 2022
[49]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review arXiv 2025
[50]

Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization,

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, et al. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization.arXiv preprint arXiv:2601.05242, 2026

work page arXiv 2026
[51]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 44

work page internal anchor Pith review arXiv 2025
[52]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

2024
[53]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

2024
[54]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review arXiv 2023
[55]

Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

work page arXiv 2025
[56]

X2i: Seamless integration of multimodal understanding into diffusion transformer via attention distillation

Jian Ma, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu, and Zhenyu Yang. X2i: Seamless integration of multimodal understanding into diffusion transformer via attention distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16733–16744, 2025

2025
[57]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Celso M de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark.arXiv:2412.07825, 2024

work page arXiv 2024
[58]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

2025
[59]

completely blind

Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer. IEEE Signal processing letters, 20(3):209–212, 2012

2012
[60]

Chatgpt.https://openai.com/blog/chatgpt/, 2023

OpenAI. Chatgpt.https://openai.com/blog/chatgpt/, 2023

2023
[61]

GPT Image 1

OpenAI. GPT Image 1. OpenAI, 2025. URLhttps://developers.openai.com/api/docs/models/gpt-image-1. OpenAI API model documentation, accessed 2026-03-24

2025
[62]

Spacer: Rein- forcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv:2504.01805, 2025

work page arXiv 2025
[63]

Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing, 2025. URL https://arxiv.org/abs/2510.19808

work page arXiv 2025
[64]

Camedit: Continuous camera parameter control for photorealistic image editing

Xinran Qin, Zhixin Wang, Fan Li, Haoyu Chen, RenJing Pei, WenBo Li, and XiaoChun Cao. Camedit: Continuous camera parameter control for photorealistic image editing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[65]

Vincie: Unlocking in-context image editing from video

Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, and Lu Jiang. Vincie: Unlocking in-context image editing from video. InThe FourteenthInternational Conference on Learning Representations, 2025

2025
[66]

Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advancesin Neural Information Processing Systems, 37:111131–111171, 2024

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advancesin Neural Information Processing Systems, 37:111131–111171, 2024

2024
[67]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In ICCV, 2021

2021
[68]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[69]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 45

work page internal anchor Pith review arXiv 2025
[70]

Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang,...

work page arXiv 2025
[71]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv:2312.11805, 2023

work page Pith review arXiv 2023
[72]

Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024

Kolors Team. Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024

2024
[73]

Longcat-image technical report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, Xunliang Cai, Yayong Guan, and Jie Hu. Longcat-image technical report. arXiv preprint arXiv:2512.07584, 2025

work page arXiv 2025
[74]

Advancing open-source world models,

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

work page arXiv 2026
[75]

23 Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al

Super Intelligence Team, Changhao Qiao, Chao Hui, Chen Li, Cunzheng Wang, Dejia Song, Jiale Zhang, Jing Li, Qiang Xiang, Runqi Wang, et al. Firered-image-edit-1.0 techinical report.arXiv preprint arXiv:2602.13344, 2026

work page arXiv 2026
[76]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review arXiv 2025
[77]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms.NeurIPS, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.NeurIPS, 2024

2024
[78]

Vidu: Ai video generator

Vidu Team. Vidu: Ai video generator. https://www.vidu.cn/, 2024

2024
[79]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page Pith review arXiv 2025
[80]

Exploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

2023
[81]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[82]

V3det challenge 2024 on vast vocabulary and open vocabulary object detection: Methods and results, 2024

Jiaqi Wang, Yuhang Zang, Pan Zhang, Tao Chu, Yuhang Cao, Zeyi Sun, Ziyu Liu, Xiaoyi Dong, Tong Wu, Dahua Lin, Zeming Chen, Zhi Wang, Lingchen Meng, Wenhao Yao, Jianwei Yang, Sihong Wu, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Peixi Wu, Bosong Chai, Xuan Nie, Longquan Yan, Zeyu Wang, Qifan Zhou, Boning Wang, Jiaqi Huang, Zunnan Xu, Xiu Li, Kehong Yuan, Yany...

work page arXiv 2024

Showing first 80 references.