arxiv: 2507.04447 · v3 · submitted 2025-07-06 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang , Hongsi Liu , Zekun Qi , Yunnan Wang , Xinqiang Yu , Jiazhao Zhang , Runpei Dong , Jiawei He

show 6 more authors

Fan Lu He Wang Zhizheng Zhang Li Yi Wenjun Zeng Xin Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:38 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords vision-language-actionworld knowledge predictionrobot manipulationstructured attentiondiffusion transformerdynamic region guidanceinverse dynamics

0 comments

The pith

DreamVLA forecasts compact dynamic, spatial and semantic world knowledge to drive a perception-prediction-action loop that raises robot manipulation success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DreamVLA as a vision-language-action model that replaces full future-image generation with targeted prediction of dynamic regions, spatial relations and semantic cues. These compact forecasts supply the information needed for inverse-dynamics action planning and are kept disentangled by a block-wise structured attention mask that blocks cross-talk among the three knowledge streams. A diffusion transformer then samples actions from the resulting latent features. The design produces a 76.7 percent success rate on real-robot tasks and a 4.44 average length on the CALVIN ABC-D benchmark.

Core claim

DreamVLA establishes a perception-prediction-action loop by forecasting dynamic-region-guided world knowledge that is integrated with spatial and semantic cues, thereby supplying compact yet comprehensive representations for action planning. Block-wise structured attention masks mutual attention among the three knowledge types to prevent leakage and maintain clean, disentangled representations. A diffusion-based transformer models the conditional distribution over future actions from the shared latent features produced by the forecasts.

What carries the argument

Dynamic-region-guided world knowledge prediction combined with spatial and semantic cues, enforced by block-wise structured attention that masks cross-stream interactions.

If this is right

The model reaches 76.7 percent success on real-robot manipulation tasks.
It attains an average length of 4.44 on the CALVIN ABC-D benchmark.
Inverse-dynamics modeling becomes feasible once compact world-knowledge forecasts replace redundant image predictions.
Disentangled representations support more reliable conditional action sampling via the diffusion transformer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Chaining multiple short-horizon knowledge forecasts could support longer task sequences without retraining the entire model.
The same block-wise masking pattern may reduce interference in other multimodal prediction settings that combine motion, layout and object semantics.
Compact forecasts lower the pixel-level reconstruction burden, potentially allowing smaller training corpora than full-image VLA baselines.

Load-bearing premise

The block-wise attention mask successfully isolates dynamic, spatial and semantic streams without removing the interactions required for coherent forecasts.

What would settle it

Replacing the world-knowledge prediction head with standard image-generation forecasting and measuring whether real-robot success falls below 76.7 percent on the same task set.

read the original abstract

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DreamVLA adds dynamic-region world knowledge forecasting and block-wise attention masking to VLAs, reports solid benchmark numbers, but supplies no ablations to show the masking actually keeps representations disentangled.

read the letter

This paper's main contribution is a VLA model called DreamVLA that forecasts comprehensive world knowledge—dynamic, spatial, and semantic—before predicting actions. It uses dynamic-region guidance for the forecasting and block-wise structured attention to mask cross-block interactions, supposedly keeping the representations disentangled. A diffusion-based transformer then handles the action distribution. They report 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D. The new part is the specific setup for injecting structured world knowledge via this forecasting loop and the attention masking to prevent leakage between cue types. It builds on the idea of image generation in VLAs but shifts to more abstract knowledge prediction, which aligns with the human-like reasoning they mention. The work does well in presenting a coherent architecture that ties perception to prediction to action. The numbers are solid for the benchmarks they use, and the motivation from limitations in existing image-based forecasting is clear. The soft spot is exactly what the stress-test highlights: there's no direct evidence shown that the block-wise masking actually produces cleaner, non-interfering representations or that it improves the forecasts in a measurable way. The abstract gives the design and the final scores but skips ablations on the masking, attention maps, or similarity metrics between representations with and without the mask. If the full paper has those, great, but from the description it seems the central mechanism lacks that validation. This makes it harder to credit the performance specifically to the disentanglement rather than other elements like the diffusion model. For readers, this is relevant to the robotics community working on VLA models, especially those focused on better generalization through world modeling. Someone looking for new architectural ideas in manipulation policies would find value here. It deserves a serious referee because the results are competitive and the framework is novel enough to warrant detailed review, though revisions would likely focus on strengthening the empirical support for the key components. I recommend putting it through peer review.

Referee Report

2 major / 1 minor

Summary. The paper proposes DreamVLA, a vision-language-action model that integrates dynamic-region-guided world knowledge prediction (combined with spatial and semantic cues) into a perception-prediction-action loop for robot manipulation. It employs block-wise structured attention to mask cross-block interactions and thereby disentangle dynamic, spatial, and semantic representations, followed by a diffusion transformer to model conditional action distributions. Reported results include a 76.7% success rate on real-robot tasks and 4.44 average length on the CALVIN ABC-D benchmark.

Significance. If the performance claims are reproducible and the architectural contributions are isolated, the work could advance VLA models by replacing redundant image-based forecasting with compact, knowledge-rich representations that align with human-like multimodal reasoning. The emphasis on preventing representation interference via structured masking is a potentially useful design principle for multi-cue prediction in robotics.

major comments (2)

[Abstract] Abstract: The headline results (76.7% real-robot success, 4.44 CALVIN length) are attributed to dynamic-region-guided world-knowledge prediction and block-wise attention, yet the manuscript supplies no ablations, error bars, or baseline comparisons that isolate these components from the diffusion transformer or standard VLA backbones.
[Abstract] Abstract: The central claim that block-wise structured attention 'prevents information leakage and keeps each representation clean and disentangled' is load-bearing for the method, but no supporting evidence—such as attention-map visualizations, cosine-similarity metrics between blocks, or ablation results showing degraded forecasts when the mask is removed—is referenced.

minor comments (1)

The abstract refers to 'extensive experiments' without specifying trial counts, robot hardware details, or statistical significance tests for the reported success rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that stronger isolation of the proposed components is needed and will revise the manuscript to include the requested ablations, error bars, visualizations, and quantitative metrics. Below we respond point by point.

read point-by-point responses

Referee: [Abstract] Abstract: The headline results (76.7% real-robot success, 4.44 CALVIN length) are attributed to dynamic-region-guided world-knowledge prediction and block-wise attention, yet the manuscript supplies no ablations, error bars, or baseline comparisons that isolate these components from the diffusion transformer or standard VLA backbones.

Authors: We acknowledge the need for explicit isolation. The full manuscript already contains comparisons against several VLA baselines (e.g., RT-1, Octo, and diffusion-based variants), but these do not fully ablate the dynamic-region guidance or the block-wise mask. In the revision we will add (i) an ablation removing dynamic-region guidance while keeping the rest of the architecture fixed, (ii) an ablation replacing block-wise attention with standard cross-attention, and (iii) error bars computed over three random seeds for both real-robot and CALVIN results. These tables will be placed in the Experiments section and referenced from the abstract. revision: yes
Referee: [Abstract] Abstract: The central claim that block-wise structured attention 'prevents information leakage and keeps each representation clean and disentangled' is load-bearing for the method, but no supporting evidence—such as attention-map visualizations, cosine-similarity metrics between blocks, or ablation results showing degraded forecasts when the mask is removed—is referenced.

Authors: We agree that direct empirical support for the disentanglement claim is currently insufficient. In the revised manuscript we will add: (1) attention-map visualizations for the dynamic, spatial, and semantic blocks before and after masking, (2) cosine-similarity matrices computed between the three block outputs across multiple layers, and (3) a quantitative ablation that removes the block-wise mask and reports the resulting drop in world-knowledge prediction accuracy and downstream task success. These results will be presented in a new subsection of the Method or Experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on empirical success rates (76.7% real-robot, 4.44 CALVIN length) obtained from experiments rather than any self-referential derivation. No equations, fitted parameters, or predictions are shown that reduce by construction to inputs. Architectural elements such as block-wise structured attention and dynamic-region-guided forecasting are presented as design choices without load-bearing self-citations or uniqueness theorems imported from prior author work. The perception-prediction-action loop is described at a high level but does not collapse into tautology; performance is externally validated on benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into parameters and assumptions; the main unstated premises concern the effectiveness of the proposed disentanglement and forecasting modules.

axioms (1)

domain assumption Block-wise structured attention prevents information leakage between dynamic, spatial, and semantic streams
Invoked to keep representations clean and disentangled during training.

pith-pipeline@v0.9.0 · 5584 in / 1188 out tokens · 41658 ms · 2026-05-16T15:38:13.929583+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 7.0

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
cs.CV 2026-04 unverdicted novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
Towards Generalizable Robotic Manipulation in Dynamic Environments
cs.CV 2026-03 unverdicted novelty 7.0

DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation
cs.RO 2026-02 unverdicted novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation
cs.RO 2025-12 unverdicted novelty 6.0

DreamTacVLA grounds VLA models in contact physics by aligning multi-scale vision-tactile inputs and predicting future tactile states, reaching up to 95% success on contact-rich tasks.
Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study
cs.LG 2026-04 unverdicted novelty 5.0

Explicit geometry-based feasibility supervision added to diffusion VLA training leads to better physical reliability, task success, and faster learning with limited data in manipulation tasks.

Reference graph

Works this paper leans on

147 extracted references · 147 canonical work pages · cited by 19 Pith papers · 35 internal anchors

[1]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 1, 3, 7, 8, 9, 28

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav M...

work page 2023
[3]

Video language planning.arXiv preprint arXiv:2310.10625, 2023

Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language planning.arXiv preprint arXiv:2310.10625, 2023

work page arXiv 2023
[4]

Embodiedgpt: Vision-language pre-training via embodied chain of thought

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[5]

Robotic Control via Embodied Chain-of-Thought Reasoning

Zawalski Michał, Chen William, Pertsch Karl, Mees Oier, Finn Chelsea, and Levine Sergey. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Learning manipulation skills through robot chain-of-thought with sparse failure guidance

Kaifeng Zhang, Zhao-Heng Yin, Weirui Ye, and Yang Gao. Learning manipulation skills through robot chain-of-thought with sparse failure guidance. arXiv preprint arXiv:2405.13573, 2024

work page arXiv 2024
[7]

Robotwin: Dual-arm robot benchmark with generative digital twins (early version)

Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version). In European Conference on Computer Vision, pages 264–273. Springer, 2025

work page 2025
[8]

Scissorbot: Learning generalizable scissor skill for paper cutting via simulation, imitation, and sim2real

Jiangran Lyu, Yuxing Chen, Tao Du, Feng Zhu, Huiquan Liu, Yizhou Wang, and He Wang. Scissorbot: Learning generalizable scissor skill for paper cutting via simulation, imitation, and sim2real. arXiv preprint arXiv:2409.13966, 2024

work page arXiv 2024
[9]

Gapartmanip: A large-scale part-centric dataset for material-agnostic articulated object manipulation

Wenbo Cui, Chengyang Zhao, Songlin Wei, Jiazhao Zhang, Haoran Geng, Yaran Chen, Haoran Li, and He Wang. Gapartmanip: A large-scale part-centric dataset for material-agnostic articulated object manipulation. arXiv preprint arXiv:2411.18276, 2024

work page arXiv 2024
[10]

Theia: Distilling diverse vision foundation models for robot learning

Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Keleste- mur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning. arXiv preprint arXiv:2407.20179, 2024

work page arXiv 2024
[11]

Dexvlg: Dexterous vision-language-grasp model at scale

Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi, and He Wang. Dexvlg: Dexterous vision-language-grasp model at scale. arXiv preprint arXiv:2507.02747, 2025. 1

work page arXiv 2025
[12]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Ab- hishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024. 1, 3, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. In The Twelfth International Conference on Learning Representations . 3, 7, 8, 25

work page
[15]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Cliport: What and where pathways for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning, pages 894–906. PMLR, 2022. 3

work page 2022
[17]

Data scaling laws in imitation learning for robotic manipulation

Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation. arXiv preprint arXiv:2410.18647, 2024

work page arXiv 2024
[18]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doer- sch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Zhao, and Chelsea Finn

Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. In Conference on Robot Learning (CoRL), 2024. 12

work page 2024
[21]

Dywa: Dynamics-adaptive world action model for generalizable non-prehensile manipulation

Jiangran Lyu, Ziming Li, Xuesong Shi, Chaoyi Xu, Yizhou Wang, and He Wang. Dywa: Dynamics-adaptive world action model for generalizable non-prehensile manipulation. arXiv preprint arXiv:2503.16806, 2025. 3

work page arXiv 2025
[22]

Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation

Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, et al. Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation. arXiv preprint arXiv:2502.13143, 2025. 2, 11, 27, 28

work page arXiv 2025
[23]

Learning getting-up policies for real-world humanoid robots

Xialin He, Runpei Dong, Zixuan Chen, and Saurabh Gupta. Learning getting-up policies for real-world humanoid robots. arXiv preprint arXiv:2502.12152, 2025

work page arXiv 2025
[24]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model. arXiv preprint arXiv:2503.00200, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Gamma: Graspability-aware mobile manipulation policy learning based on online grasping pose fusion

Jiazhao Zhang, Nandiraju Gireesh, Jilong Wang, Xiaomeng Fang, Chaoyi Xu, Weiguang Chen, Liu Dai, and He Wang. Gamma: Graspability-aware mobile manipulation policy learning based on online grasping pose fusion. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 1399–1405. IEEE, 2024. 1

work page 2024
[26]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems , 36:34892–34916, 2023. 1, 3

work page 2023
[27]

Prismatic vlms: Investigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024

work page arXiv 2024
[28]

Gpt-4v(ision) system card, 2023

OpenAI. Gpt-4v(ision) system card, 2023. URL https://openai.com/research/ gpt-4v-system-card . 3

work page 2023
[29]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Vision-language foundation models as effective robot imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. In The Twelfth International Conference on Learning Representations . 1, 3, 7, 8, 28

work page
[31]

Llarva: Vision-action instruction tuning enhances robot learning

Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, and Roei Herzig. Llarva: Vision-action instruction tuning enhances robot learning. In 8th Annual Conference on Robot Learning , 2024

work page 2024
[32]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024. 1, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Showui: One vision-language-action model for gui visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. arXiv preprint arXiv:2411.17465, 2024. 3

work page arXiv 2024
[35]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters , 2025. 3

work page 2025
[36]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representa- tions for visual-language-action model, 2025. URL https://arxiv.org/abs/2501.15830. 8 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Towards generalist robot policies: What matters in building vision-language-action models

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models. arXiv preprint arXiv:2412.14058, 2024. 3, 7, 8, 29

work page arXiv 2024
[38]

Flower: Democratizing generalist robot policies with efficient vision-language- action flow policies

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language- action flow policies. In 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities. 3

work page
[39]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024. 1, 3

work page 2024
[44]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631, 2024. 1, 3, 28

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Pivot: Iterative visual prompting elicits actionable knowledge for vlms

Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, et al. Pivot: Iterative visual prompting elicits actionable knowledge for vlms. In International Conference on Machine Learning , pages 37321–37341. PMLR, 2024

work page 2024
[46]

Rt-trajectory: Robotic task generalization via hindsight trajectory sketches

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. In The Twelfth International Conference on Learning Representations

work page
[47]

Pivot-r: Primitive-driven waypoint-aware world model for robotic manipulation

Kaidong Zhang, Pengzhen Ren, Bingqian Lin, Junfan Lin, Shikui Ma, Hang Xu, and Xiaodan Liang. Pivot-r: Primitive-driven waypoint-aware world model for robotic manipulation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page
[48]

Any-point trajectory modeling for policy learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023. 3

work page arXiv 2023
[49]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803, 2024. 3, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Efficient robotic policy learning via latent space backward planning

Dongxiu Liu, Haoyi Niu, Zhihao Wang, Jinliang Zheng, Yinan Zheng, Zhonghong Ou, Jian- ming Hu, Jianxiong Li, and Xianyuan Zhan. Efficient robotic policy learning via latent space backward planning. arXiv preprint arXiv:2505.06861, 2025. 3

work page arXiv 2025
[51]

Pixel motion as universal representation for robot control

Kanchana Ranasinghe, Xiang Li, Cristina Mata, Jongwoo Park, and Michael S Ryoo. Pixel motion as universal representation for robot control. arXiv preprint arXiv:2505.07817, 2025. 3 14

work page arXiv 2025
[52]

Symbolically-guided visual plan inference from uncurated video data

Wenyan Yang, Ahmet Tikna, Yi Zhao, Yuying Zhang, Luigi Palopoli, Marco Roveri, and Joni Pajarinen. Symbolically-guided visual plan inference from uncurated video data. arXiv preprint arXiv:2505.08444, 2025

work page arXiv 2025
[53]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through neural trajectories. arXiv preprint arXiv:2505.12705, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Ladi- wm: A latent diffusion-based world model for predictive manipulation

Yuhang Huang, JIazhao Zhang, Shilong Zou, XInwang Liu, Ruizhen Hu, and Kai Xu. Ladi- wm: A latent diffusion-based world model for predictive manipulation. arXiv preprint arXiv:2505.11528, 2025

work page arXiv 2025
[55]

Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy conditioning

Jiange Yang, Haoyi Zhu, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy conditioning. ArXiv, abs/2411.14519, 2024. 1

work page arXiv 2024
[56]

Predictive inverse dynamics models are scalable learners for robotic manipulation

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. Int. Conf. Learn. Represent. (ICLR), 2024. 1, 3, 7, 8, 9, 25

work page 2024
[57]

Up-vla: A unified understanding and prediction model for embodied agent

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent. arXiv preprint arXiv:2501.18867, 2025. 7, 8, 25

work page arXiv 2025
[58]

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. arXiv preprint arXiv:2503.22020, 2025. 2, 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Gripper keypose and object pointflow as interfaces for bimanual robotic manipulation

Yuyin Yang, Zetao Cai, Yang Tian, Jia Zeng, and Jiangmiao Pang. Gripper keypose and object pointflow as interfaces for bimanual robotic manipulation. arXiv preprint arXiv:2504.17784, 2025

work page arXiv 2025
[60]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. arXiv preprint arXiv:2504.02792, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Reinbot: Amplifying robot visual-language manipulation with reinforcement learning

Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, and Donglin Wang. Reinbot: Amplifying robot visual-language manipulation with reinforcement learning. arXiv preprint arXiv:2505.07395, 2025. 1, 3

work page arXiv 2025
[62]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 2022. 2, 3

work page 2022
[63]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 10371–10381,

work page
[64]

Depth anything v2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Heng- shuang Zhao. Depth anything v2. Advances in Neural Information Processing Systems , 37: 21875–21911, 2024. 3, 6, 24

work page 2024
[65]

Shapellm: Universal 3d object understanding for embodied interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understanding for embodied interaction. In Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLIII , volume 15101 of Lecture Notes in Computer Science, pages 21...

work page 2024
[66]

Contrast with reconstruct: Contrastive 3d representation learning guided by generative pre- training

Zekun Qi, Runpei Dong, Guofan Fan, Zheng Ge, Xiangyu Zhang, Kaisheng Ma, and Li Yi. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pre- training. In Int. Conf. Mach. Learn. (ICML) , 2023. 2, 3, 5, 6, 11 15

work page 2023
[67]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. In European Conference on Computer Vision, pages 18–35. Springer, 2024. 2, 5, 23

work page 2024
[68]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. arXiv preprint arXiv:2410.11831, 2024. 2, 5

work page arXiv 2024
[69]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick L...

work page 2024
[70]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross B

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023 , pages 3992–4003. IEEE, 2023. 2, 4, 6, 22, 24

work page 2023
[71]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[72]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023. 3, 22, 28

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Navid: Video-based vlm plans the next step for vision-and-language navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation. Robotics: Science and Systems , 2024. 3

work page 2024
[74]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224, 2024. 3

work page arXiv 2024
[75]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877– 1901, 2020

work page 1901
[77]

Roumeliotis and Nikolaos D

Konstantinos I. Roumeliotis and Nikolaos D. Tselikas. Chatgpt and open-ai models: A preliminary review. Future Internet, 15(6):192, 2023

work page 2023
[78]

Openai o3 and o4-mini system card, 2025

OpenAI. Openai o3 and o4-mini system card, 2025. URLhttps://openai.com/research/ o3-o4-mini-system-card . 3

work page 2025
[79]

DreamLLM: Synergistic multimodal comprehension and creation

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. In Int. Conf. Learn. Represent. (ICLR), 2024. 3, 4

work page 2024
[80]

Dreambench++: A human-aligned benchmark for personalized image generation

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation. CoRR, abs/2406.16855, 2024. 3

work page arXiv 2024

Showing first 80 references.