Orthus: Autoregressive interleaved image-text generation with modality-specific heads

· 2024 · arXiv 2412.00127

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content

cs.CY · 2026-04-06 · unverdicted · novelty 7.0

EduIllustrate is a benchmark of 230 K-12 STEM problems that evaluates LLMs on interleaved text-diagram generation using sequential anchoring and an 8-dimension rubric, with Gemini 3.0 Pro Preview scoring highest at 87.8%.

MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

cs.LG · 2025-05-22 · conditional · novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.

LongCat-Image Technical Report

cs.CV · 2025-12-08 · unverdicted · novelty 5.0

LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.

Show-o2: Improved Native Unified Multimodal Models

cs.CV · 2025-06-18 · unverdicted · novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

cs.CV · 2025-03-10

citing papers explorer

Showing 6 of 6 citing papers.

EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content cs.CY · 2026-04-06 · unverdicted · none · ref 1
EduIllustrate is a benchmark of 230 K-12 STEM problems that evaluates LLMs on interleaved text-diagram generation using sequential anchoring and an 8-dimension rubric, with Gemini 3.0 Pro Preview scoring highest at 87.8%.
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation cs.CV · 2026-04-08 · unverdicted · none · ref 15
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning cs.LG · 2025-05-22 · conditional · none · ref 32
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
LongCat-Image Technical Report cs.CV · 2025-12-08 · unverdicted · none · ref 24
LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 53
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation cs.CV · 2025-03-10 · unreviewed · ref 20

Orthus: Autoregressive interleaved image-text generation with modality-specific heads

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer