EduIllustrate is a benchmark of 230 K-12 STEM problems that evaluates LLMs on interleaved text-diagram generation using sequential anchoring and an 8-dimension rubric, with Gemini 3.0 Pro Preview scoring highest at 87.8%.
Orthus: Autoregressive interleaved image-text generation with modality-specific heads
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
citing papers explorer
-
EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content
EduIllustrate is a benchmark of 230 K-12 STEM problems that evaluates LLMs on interleaved text-diagram generation using sequential anchoring and an 8-dimension rubric, with Gemini 3.0 Pro Preview scoring highest at 87.8%.
-
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
LongCat-Image Technical Report
LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
- WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation