The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
hub
Hunyuanimage 3.0 technical report
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 17verdicts
UNVERDICTED 17roles
baseline 1polarities
baseline 1representative citing papers
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.
Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than editing-based methods.
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
Tstars-Tryon 1.0 is a deployed virtual try-on system claiming high robustness, photorealism, multi-reference flexibility, and near real-time speed for diverse fashion items.
Nano Banana 2 delivers competitive perceptual quality on image restoration but produces over-enhanced results that diverge from input fidelity in ways standard metrics miss.
Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise editing, outperforming several prior models in human tests.
citing papers explorer
-
Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro
Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.
-
Qwen-Image-VAE-2.0 Technical Report
Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than editing-based methods.
-
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
-
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
-
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
InsHuman: Towards Natural and Identity-Preserving Human Insertion
InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.
-
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.
-
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
-
Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Tstars-Tryon 1.0 is a deployed virtual try-on system claiming high robustness, photorealism, multi-reference flexibility, and near real-time speed for diverse fashion items.
-
Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks
Nano Banana 2 delivers competitive perceptual quality on image restoration but produces over-enhanced results that diverge from input fidelity in ways standard metrics miss.
-
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise editing, outperforming several prior models in human tests.