Vision encoders on single 2D molecular images with a chemistry-informed curriculum achieve top or near-top results on 10 property prediction tasks at 80x lower FLOPs than multi-modal competitors.
hub
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
57 Pith papers cite this work. Polarity classification is still indexing.
abstract
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
co-cited works
roles
background 2polarities
background 2representative citing papers
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.
3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute binding and structural control.
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
citing papers explorer
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.