citation dossier

Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523

L · 2025 · arXiv 2503.20523

17Pith papers citing it

17reference links

cs.CVtop field · 10 papers

UNVERDICTEDtop verdict bucket · 16 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 17 reviewed papers. Its strongest current cluster is cs.CV (10 papers). The largest review-status bucket among citing papers is UNVERDICTED (16 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

cs.CV · 2026-05-12 · conditional · novelty 7.0

HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

Is Your Driving World Model an All-Around Player?

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.

Latent State Design for World Models under Sufficiency Constraints

cs.AI · 2026-05-03 · unverdicted · novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

cs.RO · 2026-04-23 · unverdicted · novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

ScenarioControl introduces the first vision-language controllable generator for realistic vectorized 3D driving scenarios with temporal consistency across actor views.

Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and long-term consistency in multi-modal video generation.

LA-Pose: Latent Action Pretraining Meets Pose Estimation

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of magnitude less labeled 3D data.

Human Cognition in Machines: A Unified Perspective of World Models

cs.RO · 2026-04-17 · unverdicted · novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.

Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

eess.IV · 2026-03-30 · unverdicted · novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

cs.AI · 2025-06-11 · unverdicted · novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.

Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.

Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic

cs.AI · 2026-04-14 · unverdicted · novelty 5.0

This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.

Ozone: A Unified Platform for Transportation Research

cs.DB · 2026-04-13 · unverdicted · novelty 5.0

Ozone unifies four trajectory datasets into a canonical format with standardized schemas and provides CARLA-based benchmarking, claiming 85% faster experiment setup and 91% cross-city transfer efficiency.

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

cs.CV · 2026-04-06 · unverdicted · novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

citing papers explorer

Showing 17 of 17 citing papers.

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation cs.CV · 2026-05-12 · conditional · none · ref 18
HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
Is Your Driving World Model an All-Around Player? cs.CV · 2026-05-11 · unverdicted · none · ref 27
WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
Latent State Design for World Models under Sufficiency Constraints cs.AI · 2026-05-03 · unverdicted · none · ref 54
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis cs.RO · 2026-04-23 · unverdicted · none · ref 20
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models cs.CV · 2026-04-20 · unverdicted · none · ref 33
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation cs.CV · 2026-04-18 · unverdicted · none · ref 41
ScenarioControl introduces the first vision-language controllable generator for realistic vectorized 3D driving scenarios with temporal consistency across actor views.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CV · 2026-05-03 · unverdicted · none · ref 37
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and long-term consistency in multi-modal video generation.
LA-Pose: Latent Action Pretraining Meets Pose Estimation cs.CV · 2026-04-30 · unverdicted · none · ref 24
LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of magnitude less labeled 3D data.
Human Cognition in Machines: A Unified Perspective of World Models cs.RO · 2026-04-17 · unverdicted · none · ref 143
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction cs.CV · 2026-04-13 · unverdicted · none · ref 60
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving cs.CV · 2026-04-09 · unverdicted · none · ref 39
LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms eess.IV · 2026-03-30 · unverdicted · none · ref 172
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 46
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation cs.CV · 2026-04-20 · unverdicted · none · ref 3
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.
Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic cs.AI · 2026-04-14 · unverdicted · none · ref 19
This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.
Ozone: A Unified Platform for Transportation Research cs.DB · 2026-04-13 · unverdicted · none · ref 5
Ozone unifies four trajectory datasets into a canonical format with standardized schemas and provides CARLA-based benchmarking, claiming 85% faster experiment setup and 91% cross-city transfer efficiency.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models cs.CV · 2026-04-06 · unverdicted · none · ref 104
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer