hub Baseline reference

LongCat-Image Technical Report

Meituan LongCat Team: Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao · 2025 · cs.CV · arXiv 2512.07584

Baseline reference. 83% of citing Pith papers use this work as a benchmark or comparison.

36 Pith papers citing it

Baseline 83% of classified citations

open full Pith review browse 36 citing papers arXiv PDF

abstract

We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 5 background 1

citation-polarity summary

baseline 5 background 1

representative citing papers

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.

OrthoTryOn: Geometric Orthogonalization for Conflict-Free Unified Fashion Generation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

OrthoTryOn uses Orthogonal Subspace Projection on shared LoRA and Fisher-guided Negative Guidance to enable conflict-free unified fashion generation, outperforming task-specific models on benchmarks.

InterleaveThinker: Reinforcing Agentic Interleaved Generation

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.

Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment

cs.CV · 2026-06-04 · unverdicted · novelty 7.0 · 2 refs

RED-Aes learns aesthetic changes from edit-induced image pairs and a new RED-20k dataset via three-stage relative ranking training, claiming SOTA generalization over absolute MOS regression.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.

Gen-Searcher: Reinforcing Agentic Search for Image Generation

cs.CV · 2026-03-30 · unverdicted · novelty 7.0 · 2 refs

Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.

COMFYCLAW: Self-Evolving Skill Harnesses for Image Generation Workflows

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

COMFYCLAW introduces skill evolution via graph editing, automatic reversion, VLM verification, and distillation of runs into reusable Agent Skills, achieving higher average scores than a verifier-only baseline across benchmarks.

MirrorPPR: Exemplar-Based Portrait Photo Retouching

cs.CV · 2026-06-28 · unverdicted · novelty 6.0

MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.

Monocular Avatar Reconstruction via Cascaded Diffusion Priors and UV-Space Differentiable Shading

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

A cascaded LoRA diffusion pipeline in UV space with cross-intrinsic attention and differentiable BRDF shading produces 4K PBR avatar assets from single images after training on under 100 scans.

TextWand: A Unified Framework for Scene Text Editing

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

TextWand unifies scene text removal, generation and replacement via rendering/erasure decomposition, ORPE for layout fidelity, RAS for clean erasure, and the new TextWand-Bench dataset, claiming superior accuracy and quality over prior models.

TextSculptor: Training and Benchmarking Scene Text Editing

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

TextSculptor supplies an automated data synthesis pipeline yielding 3.2M samples plus a four-task benchmark that raises open-source scene text editing performance.

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.

Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

Self-prompting diffusion transformer uses in-context learning on self-generated prompts from the image to achieve open-vocabulary scene text editing with style consistency.

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.

Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.

Open-Source Image Editing Models Are Zero-Shot Vision Learners

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

cs.CV · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

FashionStylist is an expert-annotated benchmark dataset that unifies outfit-to-item grounding, completion, and evaluation tasks for multimodal large language models in fashion.

citing papers explorer

Showing 2 of 2 citing papers after filters.

COMFYCLAW: Self-Evolving Skill Harnesses for Image Generation Workflows cs.AI · 2026-07-02 · unverdicted · none · ref 61 · internal anchor
COMFYCLAW introduces skill evolution via graph editing, automatic reversion, VLM verification, and distillation of runs into reusable Agent Skills, achieving higher average scores than a verifier-only baseline across benchmarks.
Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models cs.AI · 2026-05-16 · unverdicted · none · ref 4 · internal anchor
Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.

LongCat-Image Technical Report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer