{"work":{"id":"d3153e5f-b6e2-4ab3-9f41-e24e24d64496","openalex_id":null,"doi":null,"arxiv_id":"2506.18871","raw_key":null,"title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","authors":null,"authors_text":null,"year":2025,"venue":"cs.CV","abstract":"In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2","external_url":"https://arxiv.org/abs/2506.18871","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-15T03:24:56.077405+00:00","pith_arxiv_id":"2506.18871","created_at":"2026-05-09T06:55:43.677127+00:00","updated_at":"2026-05-15T03:24:56.077405+00:00","title_quality_ok":true,"display_title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","render_title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation"},"hub":{"state":{"work_id":"d3153e5f-b6e2-4ab3-9f41-e24e24d64496","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":50,"external_cited_by_count":null,"distinct_field_count":4,"first_pith_cited_at":"2025-06-18T15:39:15+00:00","last_pith_cited_at":"2026-05-14T17:58:19+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-15T05:37:29.637816+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":2},{"context_role":"baseline","n":2}],"polarity_counts":[{"context_polarity":"baseline","n":2},{"context_polarity":"background","n":1},{"context_polarity":"support","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T13:51:16.699044+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":32},{"title":"Qwen-Image Technical Report","work_id":"d06d7ecc-7579-4f89-a60b-4278a0f3c562","shared_citers":28},{"title":"Step1X-Edit: A Practical Framework for General Image Editing","work_id":"3392f2c8-a1cb-4d6c-8c82-2cdccffa33f9","shared_citers":26},{"title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","work_id":"488a273e-95d8-46f1-87c7-2244068d00d0","shared_citers":23},{"title":"BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset","work_id":"86d896d2-592f-4d9b-938e-dfeb11f9388f","shared_citers":20},{"title":"Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling","work_id":"67d9e391-26d1-459e-ab56-07e60511c886","shared_citers":20},{"title":"FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space","work_id":"5dfe19d5-3541-4803-8fe9-3c8b9e29b281","shared_citers":19},{"title":"Emu3: Next-Token Prediction is All You Need","work_id":"720d288e-fac0-464c-9929-19efd9a52afc","shared_citers":18},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":16},{"title":"Show-o2: Improved Native Unified Multimodal Models","work_id":"77f00563-1ce6-4fba-9d4e-c8ce83f716ac","shared_citers":15},{"title":"ImgEdit: A Unified Image Editing Dataset and Benchmark","work_id":"059b5c3a-404c-4d30-a631-68c1d88a08a7","shared_citers":14},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":12},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":11},{"title":"Seedream 4.0: Toward Next-generation Multimodal Image Generation","work_id":"15c839a0-48a3-4218-82b6-cac5b7f66e13","shared_citers":11},{"title":"Show-o: One Single Transformer to Unify Multimodal Understanding and Generation","work_id":"1393dc24-a6b2-44e1-b5d7-7009d1fa4811","shared_citers":11},{"title":"ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment","work_id":"94248955-4bc5-4517-98a0-66224a36d865","shared_citers":10},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":10},{"title":"arXiv preprint arXiv:2510.06679 (2025)","work_id":"caa956fb-4bc6-4cba-ab91-329773bba8a1","shared_citers":9},{"title":"Prompt-to-Prompt Image Editing with Cross Attention Control","work_id":"196f7eef-d65a-47e4-b815-9a188f6aedcf","shared_citers":9},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":9},{"title":"arXiv preprint arXiv:2505.22705 (2025)","work_id":"68d4c0f7-3dfd-438d-a823-6a93fd0a835d","shared_citers":8},{"title":"Chameleon: Mixed-Modal Early-Fusion Foundation Models","work_id":"2661b9a6-25cc-41a1-8100-612d2b801289","shared_citers":8},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":8},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8}],"time_series":[{"n":3,"year":2025},{"n":44,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T13:51:08.711288+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T13:51:11.572611+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","claims":[{"claim_text":"In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of Omn","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks OmniGen2: Towards Instruction-Aligned Multimodal Generation because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T13:51:21.508645+00:00"}},"summary":{"title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","claims":[{"claim_text":"In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of Omn","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks OmniGen2: Towards Instruction-Aligned Multimodal Generation because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":32},{"title":"Qwen-Image Technical Report","work_id":"d06d7ecc-7579-4f89-a60b-4278a0f3c562","shared_citers":28},{"title":"Step1X-Edit: A Practical Framework for General Image Editing","work_id":"3392f2c8-a1cb-4d6c-8c82-2cdccffa33f9","shared_citers":26},{"title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","work_id":"488a273e-95d8-46f1-87c7-2244068d00d0","shared_citers":23},{"title":"BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset","work_id":"86d896d2-592f-4d9b-938e-dfeb11f9388f","shared_citers":20},{"title":"Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling","work_id":"67d9e391-26d1-459e-ab56-07e60511c886","shared_citers":20},{"title":"FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space","work_id":"5dfe19d5-3541-4803-8fe9-3c8b9e29b281","shared_citers":19},{"title":"Emu3: Next-Token Prediction is All You Need","work_id":"720d288e-fac0-464c-9929-19efd9a52afc","shared_citers":18},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":16},{"title":"Show-o2: Improved Native Unified Multimodal Models","work_id":"77f00563-1ce6-4fba-9d4e-c8ce83f716ac","shared_citers":15},{"title":"ImgEdit: A Unified Image Editing Dataset and Benchmark","work_id":"059b5c3a-404c-4d30-a631-68c1d88a08a7","shared_citers":14},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":12},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":11},{"title":"Seedream 4.0: Toward Next-generation Multimodal Image Generation","work_id":"15c839a0-48a3-4218-82b6-cac5b7f66e13","shared_citers":11},{"title":"Show-o: One Single Transformer to Unify Multimodal Understanding and Generation","work_id":"1393dc24-a6b2-44e1-b5d7-7009d1fa4811","shared_citers":11},{"title":"ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment","work_id":"94248955-4bc5-4517-98a0-66224a36d865","shared_citers":10},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":10},{"title":"arXiv preprint arXiv:2510.06679 (2025)","work_id":"caa956fb-4bc6-4cba-ab91-329773bba8a1","shared_citers":9},{"title":"Prompt-to-Prompt Image Editing with Cross Attention Control","work_id":"196f7eef-d65a-47e4-b815-9a188f6aedcf","shared_citers":9},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":9},{"title":"arXiv preprint arXiv:2505.22705 (2025)","work_id":"68d4c0f7-3dfd-438d-a823-6a93fd0a835d","shared_citers":8},{"title":"Chameleon: Mixed-Modal Early-Fusion Foundation Models","work_id":"2661b9a6-25cc-41a1-8100-612d2b801289","shared_citers":8},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":8},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8}],"time_series":[{"n":3,"year":2025},{"n":44,"year":2026}],"dependency_candidates":[]},"authors":[]}}