DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.
citing papers explorer
-
DeepLatent: Think with Images via Parallel Latent Visual Reasoning
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
-
ETCHR: Editing To Clarify and Harness Reasoning
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
-
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
-
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
-
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping
VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
-
Mull-Tokens: Modality-Agnostic Latent Thinking
Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
-
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.
-
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation
ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.