Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation

Chern, E · 2024 · arXiv 2407.06135

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

ETCHR: Editing To Clarify and Harness Reasoning

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.

Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

cs.AI · 2026-01-14 · unverdicted · novelty 7.0

Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.

VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

cs.CV · 2025-11-17 · conditional · novelty 7.0

VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.

CASCADE: Context-Aware Relaxation for Speculative Image Decoding

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.

Mull-Tokens: Modality-Agnostic Latent Thinking

cs.CV · 2025-12-11 · unverdicted · novelty 6.0

Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.

GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning

cs.CV · 2025-08-13 · unverdicted · novelty 6.0

GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

cs.CL · 2025-01-13 · unverdicted · novelty 6.0

MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation

cs.CV · 2026-06-29 · unverdicted · novelty 4.0

ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.

citing papers explorer

Showing 11 of 11 citing papers.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning cs.CV · 2026-05-30 · unverdicted · none · ref 89
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
ETCHR: Editing To Clarify and Harness Reasoning cs.CV · 2026-05-22 · unverdicted · none · ref 3
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both cs.CV · 2026-05-14 · unverdicted · none · ref 3
ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning cs.AI · 2026-01-14 · unverdicted · none · ref 8
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping cs.CV · 2025-11-17 · conditional · none · ref 6
VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding cs.CV · 2026-05-08 · unverdicted · none · ref 10
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
Mull-Tokens: Modality-Agnostic Latent Thinking cs.CV · 2025-12-11 · unverdicted · none · ref 13
Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning cs.CV · 2025-08-13 · unverdicted · none · ref 3
GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought cs.CL · 2025-01-13 · unverdicted · none · ref 5
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 21
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation cs.CV · 2026-06-29 · unverdicted · none · ref 9
ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.

Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer