pith. machine review for the scientific record. sign in

arxiv: 2405.09818 · v2 · submitted 2024-05-16 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team

Authors on Pith no claims yet

Pith reviewed 2026-05-11 09:57 UTC · model grok-4.3

classification 💻 cs.CL
keywords early-fusionmixed-modal modelstoken-based architecturemultimodal foundation modelsimage captioningimage generationunified multimodal modelingarbitrary modality sequences
0
0 comments X

The pith

A single early-fusion token model processes and generates arbitrary sequences of text and images at competitive levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Chameleon as a family of models that fuse image and text tokens early in a shared sequence, allowing one architecture to handle understanding and generation across any mix of the two modalities. This setup includes a stable training recipe from the start plus an alignment process tailored to mixed inputs. If the approach holds, it removes the need for separate modality encoders, decoders, or post-training patches that current multimodal systems often require. A reader would care because it points toward foundation models that treat full documents as unified token streams rather than switching between text-only and vision-only subsystems.

Core claim

Chameleon is a family of early-fusion token-based mixed-modal models that understand and generate images and text in any arbitrary sequence. A stable training approach from inception, combined with a dedicated alignment recipe and architectural parameterization for the mixed-modal case, enables the models to reach state-of-the-art image captioning, outperform Llama-2 on text-only tasks while remaining competitive with Mixtral 8x7B and Gemini-Pro, and perform non-trivial image generation, all inside one model. On a new long-form mixed-modal generation benchmark, the models match or exceed the performance of much larger systems such as Gemini Pro and GPT-4V according to human judgments.

What carries the argument

Early-fusion token-based architecture that converts both images and text into a single shared token vocabulary and processes them as one sequence without modality-specific components.

If this is right

  • Visual question answering, image captioning, text generation, and image generation become interchangeable tasks within one model.
  • Performance on mixed-modal long-form generation reaches or surpasses that of larger specialized models according to human raters.
  • Unified modeling of complete multimodal documents becomes practical without stitching together separate systems.
  • A single set of parameters scales across text-only, vision-only, and interleaved cases without additional engineering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-sequence approach could extend directly to video or audio by adding more token types, treating them as additional elements in the same stream.
  • Training stability from inception may reduce the engineering overhead currently spent on modality alignment stages in other multimodal systems.
  • If the architecture generalizes, future scaling laws could be measured on mixed sequences rather than on text or images in isolation.
  • Human preference for the model's mixed outputs suggests it could support more natural document-level interactions than pipelines that alternate between separate models.

Load-bearing premise

An early-fusion token-based architecture can be trained stably from the beginning and aligned to handle arbitrary mixed image-text sequences without any modality-specific parts or later fixes.

What would settle it

Training runs that require separate pre-training stages for images and text, or that produce incoherent outputs on novel interleaved sequences, would show the unified early-fusion claim does not hold.

read the original abstract

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Chameleon, a family of early-fusion token-based mixed-modal foundation models that process and generate arbitrary sequences of images and text within a single unified transformer. It details an architectural parameterization using a shared vocabulary and transformer stack, a stable training recipe based on next-token prediction over interleaved multimodal sequences from inception, and a subsequent alignment stage. Evaluations span visual question answering, image captioning, text-only generation, image generation, and a new human preference evaluation on long-form mixed-modal outputs, with reported results including state-of-the-art image captioning, outperformance of Llama-2 on text tasks while remaining competitive with Mixtral 8x7B and Gemini-Pro, non-trivial image generation, and human judgments matching or exceeding those for Gemini Pro and GPT-4V.

Significance. If the results hold under scrutiny, the work provides concrete evidence that a single early-fusion token-based architecture can be trained stably at scale to handle mixed-modal sequences without modality-specific encoders, decoders, or post-hoc alignment modules. This could simplify multimodal system design and improve coherence in tasks requiring interleaved image-text reasoning and generation. The provision of a described training recipe, quantitative benchmark numbers, and a new human evaluation protocol on mixed sequences constitutes a useful contribution to the literature on unified multimodal models.

major comments (3)
  1. [§5.2] §5.2 (Image Captioning Results): The state-of-the-art claim on captioning benchmarks is load-bearing for the broad-capabilities argument, yet the manuscript provides no ablation isolating the contribution of early-fusion training data versus scale or architecture; without this, it remains unclear whether the gains derive from the unified design or from data advantages.
  2. [Human Evaluation subsection] Human Evaluation subsection (long-form mixed-modal): The preference results over GPT-4V and Gemini Pro are central to the claim of matching larger models on complex interleaved tasks; however, the section does not report inter-annotator agreement, prompt sampling methodology, or statistical significance tests, weakening the interpretability of the human judgments.
  3. [§4] §4 (Training and Alignment): The stable training recipe is presented as addressing the weakest assumption of training without modality-specific fixes, but the manuscript lacks quantitative diagnostics (e.g., loss curves per modality or gradient norms during early training) that would allow readers to verify stability at the reported scales.
minor comments (3)
  1. [§3.1] §3.1 (Architecture): The description of the unified vocabulary and image tokenization would benefit from an explicit example showing an interleaved sequence and its tokenization to clarify how early fusion is realized in practice.
  2. [Figure 3] Figure 3 (Model scaling): Axis labels and legend entries are difficult to read at the published resolution; increasing font size or adding a supplementary table of exact parameter counts and training tokens would improve clarity.
  3. [Related Work] Related Work: The discussion of prior early-fusion approaches (e.g., references to Flamingo or other token-based multimodal models) could be expanded with a direct comparison table of architectural differences to better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback and recommendation for minor revision. We address each major comment below, providing the strongest honest defense of the manuscript while noting where revisions or clarifications are warranted.

read point-by-point responses
  1. Referee: [§5.2] §5.2 (Image Captioning Results): The state-of-the-art claim on captioning benchmarks is load-bearing for the broad-capabilities argument, yet the manuscript provides no ablation isolating the contribution of early-fusion training data versus scale or architecture; without this, it remains unclear whether the gains derive from the unified design or from data advantages.

    Authors: We acknowledge that a controlled ablation isolating early-fusion data effects from scale and architecture would strengthen attribution of the captioning gains. However, the scale of these models makes such experiments computationally prohibitive. The early-fusion design is what enables training from inception on interleaved multimodal sequences without modality-specific encoders or post-hoc alignment; the data curation is a direct consequence of this architecture rather than an independent advantage. We will revise §5.2 to explicitly discuss this interdependence and add a limitations paragraph noting the lack of full ablations. revision: partial

  2. Referee: [Human Evaluation subsection] Human Evaluation subsection (long-form mixed-modal): The preference results over GPT-4V and Gemini Pro are central to the claim of matching larger models on complex interleaved tasks; however, the section does not report inter-annotator agreement, prompt sampling methodology, or statistical significance tests, weakening the interpretability of the human judgments.

    Authors: We agree that these details are necessary for full interpretability. In the revised manuscript we will report inter-annotator agreement (e.g., Cohen's kappa), provide a precise description of the prompt and output sampling procedure, and include statistical significance tests (p-values and confidence intervals) on the preference scores. revision: yes

  3. Referee: [§4] §4 (Training and Alignment): The stable training recipe is presented as addressing the weakest assumption of training without modality-specific fixes, but the manuscript lacks quantitative diagnostics (e.g., loss curves per modality or gradient norms during early training) that would allow readers to verify stability at the reported scales.

    Authors: We will add the requested diagnostics. The revised version will include per-modality loss curves and early-training gradient norm statistics (either in the main text or a new appendix) to allow readers to verify the claimed stability of the next-token-prediction recipe on mixed sequences. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an early-fusion token-based architecture, a stable next-token prediction training recipe on unified sequences, an alignment stage, and reports empirical results on standard benchmarks plus human evaluations. No load-bearing step reduces by construction to fitted parameters, self-definitions, or self-citation chains; performance claims rest on direct external comparisons (Llama-2, Mixtral, Gemini-Pro, GPT-4V) that are independently verifiable. The central premise of unified mixed-modal capability is presented as a consequence of the architectural choices and training procedure, without renaming known results or smuggling ansatzes via self-citation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer assumptions plus empirical training choices; no new physical entities are postulated.

free parameters (2)
  • model size variants
    The family includes multiple parameter counts chosen to balance capability and compute.
  • training hyperparameters
    Specific learning rates, batch sizes, and alignment parameters required for stable mixed-modal training.
axioms (1)
  • domain assumption Images and text can be represented as a shared token vocabulary without loss of essential information
    Core premise of the early-fusion token approach stated in the abstract.

pith-pipeline@v0.9.0 · 5481 in / 1246 out tokens · 48154 ms · 2026-05-11T09:57:11.737205+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 58 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cross-Modal Backdoors in Multimodal Large Language Models

    cs.CR 2026-05 unverdicted novelty 8.0

    Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.

  2. G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

    cs.CV 2026-05 unverdicted novelty 7.0

    G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.

  3. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  4. SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition

    cs.HC 2026-05 unverdicted novelty 7.0

    SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.

  5. Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

    cs.AI 2026-05 unverdicted novelty 7.0

    A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.

  6. Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...

  7. SketchVLM: Vision language models can annotate images to explain thoughts and guide users

    cs.CV 2026-04 unverdicted novelty 7.0

    SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.

  8. Probing Visual Planning in Image Editing Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.

  9. Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.

  10. Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

  11. TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

    cs.AI 2026-04 conditional novelty 7.0

    TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.

  12. Transfer between Modalities with MetaQueries

    cs.CV 2025-04 unverdicted novelty 7.0

    MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.

  13. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  14. Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

    q-bio.BM 2026-05 unverdicted novelty 6.0

    Yeti is a compact tokenizer for protein structures that delivers strong codebook use, token diversity, and reconstruction while enabling from-scratch multimodal generation of plausible sequences and structures with 10...

  15. TextLDM: Language Modeling with Continuous Latent Diffusion

    cs.CL 2026-05 unverdicted novelty 6.0

    TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.

  16. Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow

    cs.CV 2026-05 unverdicted novelty 6.0

    wAR-Tok adds a Wasserstein-gradient-flow prior-matching term to tokenizer training so that discrete tokens become easier for autoregressive priors to model, cutting AR loss and raising generation FID on CIFAR-10 and I...

  17. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  18. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...

  19. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  20. CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

    cs.CV 2026-04 unverdicted novelty 6.0

    CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.

  21. Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

    cs.CV 2026-04 unverdicted novelty 6.0

    IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.

  22. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  23. Transformer Architecture with Minimal Inference Latency for Multi-Modal Wireless Networks

    eess.SY 2026-04 unverdicted novelty 6.0

    A token-routing multi-modal transformer reduces inference latency by 86.2%, GPU memory by 35%, and FLOPs by 80% for beamforming tasks with negligible accuracy loss while enabling proactive handover on a real testbed dataset.

  24. OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

    q-bio.NC 2026-04 unverdicted novelty 6.0

    OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.

  25. On the Robustness of Watermarking for Autoregressive Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Watermarking schemes for autoregressive image generation fail against removal and forgery attacks, enabling false detections and undermining synthetic content filtering.

  26. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  27. Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

    cs.CV 2026-04 unverdicted novelty 6.0

    Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose...

  28. Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.

  29. CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.

  30. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  31. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  32. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    cs.AI 2024-08 unverdicted novelty 6.0

    A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.

  33. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  34. SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment

    cs.CV 2026-05 unverdicted novelty 5.0

    SynerMedGen introduces generation-aligned understanding tasks and a two-stage training strategy that enables strong zero-shot medical image synthesis performance and outperforms specialized models when generation trai...

  35. Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

    cs.CV 2026-05 unverdicted novelty 5.0

    Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

  36. Let ViT Speak: Generative Language-Image Pre-training

    cs.CV 2026-05 unverdicted novelty 5.0

    GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.

  37. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  38. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  39. UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.

  40. Sema: Semantic Transport for Real-Time Multimodal Agents

    cs.MM 2026-04 unverdicted novelty 5.0

    Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.

  41. SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

  42. Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

    cs.CV 2026-04 unverdicted novelty 5.0

    UniRect-CoT is a training-free rectification chain-of-thought framework that treats diffusion denoising as visual reasoning and uses the model's inherent understanding to align and correct intermediate generation results.

  43. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

  44. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  45. MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...

  46. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  47. Motus: A Unified Latent Action World Model

    cs.CV 2025-12 unverdicted novelty 5.0

    Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

  48. WorldVLA: Towards Autoregressive Action World Model

    cs.RO 2025-06 unverdicted novelty 5.0

    WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.

  49. Emerging Properties in Unified Multimodal Pretraining

    cs.CV 2025-05 unverdicted novelty 5.0

    BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.

  50. BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    cs.CV 2025-05 conditional novelty 5.0

    BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.

  51. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    cs.CV 2024-08 unverdicted novelty 5.0

    Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.

  52. MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

    cs.CV 2026-04 unverdicted novelty 4.0

    MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.

  53. Identifying Topological Invariants of Non-Hermitian Systems via Domain-Adaptive Multimodal Model for Mathematics

    cond-mat.other 2026-04 unverdicted novelty 4.0

    A multimodal model with Qwen Math backbone identifies topological invariants of non-Hermitian systems from eigenvalues and eigenvectors in momentum space.

  54. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  55. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  56. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  57. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

  58. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 56 Pith papers · 21 internal anchors

  1. [1]

    arXiv preprint arXiv:2201.07520 , year=

    Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, et al. Cm3: A causal masked multimodal model of the internet.arXiv preprint arXiv:2201.07520,

  2. [2]

    Scaling laws for generative mixed-modal language models.arXiv preprint arXiv:2301.03728,

    Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models.arXiv preprint arXiv:2301.03728,

  3. [3]

    Miniatuurpaardjes prijskamp - Agriflanders 2009,

    Agriflanders. Miniatuurpaardjes prijskamp - Agriflanders 2009,

  4. [4]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254,

  5. [5]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

  6. [6]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

  7. [7]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  9. [9]

    Make-a-scene: Scene-based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors.arXiv preprint arXiv:2203.13131,

  10. [10]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  11. [11]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  12. [12]

    arXiv preprint arXiv:2107.14795 , year=

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs.arXiv preprint arXiv:2107.14795,

  13. [13]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,

  14. [14]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

  15. [15]

    Unified language-vision pretraining with dynamic discrete visual tokenization.arXiv preprint arXiv:2309.04669,

    Yang Jin, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization.arXiv preprint arXiv:2309.04669,

  16. [16]

    S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226,

  17. [17]

    Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

    Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents.arXiv preprint arXiv:2306.16527,

  18. [18]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485, 2023b. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Gu...

  19. [19]

    Decoupled Weight Decay Regularization

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer v2: Scaling up capacity and resolution, 2022.https://arxiv.org/abs/2111. 09883. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  20. [20]

    Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action.arXiv preprint arXiv:2312.17172, 2023

    https://arxiv.org/abs/2312.17172. Giacomo Marzi, Marco Balzano, and Davide Marchiori. K-alpha calculator–krippendorff’s alpha calculator: A user- friendly tool for computing krippendorff’s alpha inter-rater reliability coefficient.MethodsX, 12:102545,

  21. [21]

    doi: https://doi.org/10.1016/j.mex.2023.102545

    ISSN 2215-0161. doi: https://doi.org/10.1016/j.mex.2023.102545. https://www.sciencedirect.com/science/article/pii/ S2215016123005411. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,

  22. [22]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation.arXiv preprint arXiv:2102.12092,

  23. [23]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125,

  24. [24]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

  25. [25]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728,

  26. [26]

    arXiv preprint arXiv:2309.08632 , year=

    Rylan Schaeffer. Pretraining on the test set is all you need.arXiv preprint arXiv:2309.08632,

  27. [27]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.arXiv preprint arXiv:2210.08402,

  28. [28]

    Mille-feuille, 2010.https://en.wikipedia.org/wiki/File:Mille-feuille_20100916.jpg

    Georges Seguin. Mille-feuille, 2010.https://en.wikipedia.org/wiki/File:Mille-feuille_20100916.jpg. CC-BY-SA 3.0, https://creativecommons.org/licenses/by-sa/3.0/deed.en. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, Berlin, Germany, 2016.https://aclanthology.org/P16-1162. ShareGPT. GP...

  29. [29]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

  30. [30]

    Sagrada Familia July 2022, 2022.https://en.wikipedia.org/wiki/File:Sagrada_Familia_%28July_ 2022%29_08.jpg

    Maksim Sokolov. Sagrada Familia July 2022, 2022.https://en.wikipedia.org/wiki/File:Sagrada_Familia_%28July_ 2022%29_08.jpg. CC-BY-SA-4.0, https://creativecommons.org/licenses/by-sa/4.0/deed.en. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.Th...

  31. [31]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arxiv e-prints, art.arXiv preprint arXiv:2104.09864,

  32. [32]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    21 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

  33. [33]

    Small-scale proxies for large-scale transformer training instabilities

    Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322,

  34. [34]

    arXiv:2309.02591 , year=

    Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591,

  35. [35]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

  36. [36]

    arXiv preprint arXiv:2305.11206 , year=

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.arXiv preprint arXiv:2305.11206,