Recognition: 2 theorem links
· Lean TheoremRoboDreamer: Learning Compositional World Models for Robot Imagination
Pith reviewed 2026-05-15 20:43 UTC · model grok-4.3
The pith
RoboDreamer factorizes video generation using language primitives to create plans for unseen robot tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RoboDreamer learns a compositional world model by factorizing the video generation process. It leverages the natural compositionality of language to parse instructions into a set of lower-level primitives, which condition a set of models to generate videos. This enables compositional generalization to new natural language instructions formed as combinations of previously seen components, as well as the incorporation of additional multimodal goals such as goal images.
What carries the argument
Factorization of video generation into models each conditioned on a lower-level language primitive parsed from the full instruction.
If this is right
- Successfully synthesizes video plans on unseen goals in the RT-X dataset.
- Enables successful robot execution of the plans in simulation.
- Substantially outperforms monolithic baseline approaches to video generation.
- Allows specification of desired videos using both natural language and goal images.
Where Pith is reading between the lines
- The method could scale to real physical robots if the simulated executions transfer well.
- Similar factorization might apply to other generative tasks like action sequence prediction.
- It suggests that explicit decomposition of language can mitigate the combinatorial explosion in training data needs for world models.
Load-bearing premise
Natural language instructions can be reliably parsed into lower-level primitives whose separate models compose into coherent, realistic videos without introducing artifacts or losing task-relevant details.
What would settle it
Running the generated videos for an unseen combination of primitives through a robot simulator and checking whether the robot completes the intended task without errors or incoherent motions.
read the original abstract
Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this issue, we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos. We illustrate how this factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components. We further show how such a factorization enables us to add additional multimodal goals, allowing us to specify a video we wish to generate given both natural language instructions and a goal image. Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RoboDreamer, a compositional world model for robotic imagination that factorizes text-to-video generation by parsing natural language instructions into lower-level primitives and conditioning separate models on them. This factorization is claimed to enable generalization to unseen object-action combinations in the RT-X dataset, support successful robot execution in simulation, and allow multimodal conditioning via language plus goal images, while substantially outperforming monolithic video generation baselines.
Significance. If the compositional factorization demonstrably yields artifact-free videos and reliable execution on novel compositions, the work could advance world models for robotics by providing a scalable path to generalization without exhaustive retraining on all task variants. The multimodal extension further supports flexible goal specification in planning pipelines.
major comments (2)
- [Abstract] Abstract: The central claims of successful synthesis on unseen RT-X goals, successful simulation execution, and substantial outperformance over monolithic baselines are stated without any quantitative metrics (e.g., FVD, success rates, artifact counts), baseline names, error bars, or ablation results. This absence directly limits assessment of whether the primitive factorization satisfies the conditions for coherent composition.
- [Abstract] The description of language parsing into primitives and subsequent model composition does not specify mechanisms for handling parser ambiguity, cross-primitive alignment on shared scene elements, or quantitative checks for lost task details on held-out combinations. These omissions are load-bearing for the generalization argument, as any misalignment would falsify the outperformance and execution claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of successful synthesis on unseen RT-X goals, successful simulation execution, and substantial outperformance over monolithic baselines are stated without any quantitative metrics (e.g., FVD, success rates, artifact counts), baseline names, error bars, or ablation results. This absence directly limits assessment of whether the primitive factorization satisfies the conditions for coherent composition.
Authors: We agree that the abstract would benefit from key quantitative indicators. The full manuscript reports FVD scores, execution success rates with error bars, specific baseline names, and ablation results in Sections 4 and 5. In revision we will incorporate the most salient metrics (FVD improvement and success rate on unseen compositions) directly into the abstract while preserving its length. revision: yes
-
Referee: [Abstract] The description of language parsing into primitives and subsequent model composition does not specify mechanisms for handling parser ambiguity, cross-primitive alignment on shared scene elements, or quantitative checks for lost task details on held-out combinations. These omissions are load-bearing for the generalization argument, as any misalignment would falsify the outperformance and execution claims.
Authors: The abstract is intentionally concise. The manuscript details the LLM-based parser with ambiguity handling via confidence thresholding and fallback sampling (Section 3.1), cross-primitive alignment through shared visual latents and consistency losses (Section 3.2), and quantitative verification via held-out ablations and task-fidelity metrics (Section 4.3). We will add a single sentence to the abstract summarizing these mechanisms and referencing the supporting results. revision: yes
Circularity Check
No significant circularity in RoboDreamer's empirical compositional factorization
full rationale
The paper presents RoboDreamer as an empirical method that parses language instructions into lower-level primitives and trains separate conditional models whose outputs are composed for video generation. No equations, fitted parameters, or self-citations are shown to reduce the claimed compositional generalization or video synthesis results to inputs by construction. Claims of success on unseen RT-X goals and outperformance over monolithic baselines rest on experimental outcomes rather than tautological redefinitions or load-bearing self-references. The approach is self-contained against external benchmarks with no detected self-definitional, prediction-renaming, or uniqueness-imported circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Language instructions can be parsed into a set of lower-level primitives that capture compositionality.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LogicAsFunctionalEquationRCL_is_unique_functional_form_of_logic echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
This factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.
-
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
Action Images: End-to-End Policy Learning via Multiview Video Generation
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
-
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memor...
-
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
-
Causal World Modeling for Robot Control
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Reference graph
Works this paper leans on
-
[7]
Unsupervised learning of compositional energy concepts
Du, Y., Li, S., Sharma, Y., Tenenbaum, J., and Mordatch, I. Unsupervised learning of compositional energy concepts. Advances in Neural Information Processing Systems, 34: 0 15608--15620, 2021
work page 2021
-
[8]
B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W
Du, Y., Durkan, C., Strudel, R., Tenenbaum, J. B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W. S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International Conference on Machine Learning, pp.\ 8489--8510. PMLR, 2023 a
work page 2023
-
[12]
G., Tapaswi, M., Laptev, I., and Schmid, C
Guhur, P.-L., Chen, S., Pinel, R. G., Tapaswi, M., Laptev, I., and Schmid, C. Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning, pp.\ 175--187. PMLR, 2023
work page 2023
-
[15]
Diffusion-based generation, optimization, and planning in 3d scenes
Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., and Zhu, S.-C. Diffusion-based generation, optimization, and planning in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16750--16761, 2023
work page 2023
-
[17]
James, S., Ma, Z., Arrojo, D. R., and Davison, A. J. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5 0 (2): 0 3019--3026, 2020
work page 2020
-
[20]
Ko, P.-C., Mao, J., Du, Y., Sun, S.-H., and Tenenbaum, J. B. Learning to act from actionless videos through dense correspondences, 2023
work page 2023
-
[21]
Adaptdiffuser: Diffusion models as adaptive self-evolving planners
Liang, Z., Mu, Y., Ding, M., Ni, F., Tomizuka, M., and Luo, P. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. In International Conference on Machine Learning, 2023
work page 2023
-
[23]
Learning to compose visual relations
Liu, N., Li, S., Du, Y., Tenenbaum, J., and Torralba, A. Learning to compose visual relations. Advances in Neural Information Processing Systems, 34: 0 23166--23178, 2021
work page 2021
-
[24]
Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pp.\ 423--439. Springer, 2022
work page 2022
-
[25]
Structdiffusion: Language-guided creation of physically-valid structures using unseen objects
Liu, W., Du, Y., Hermans, T., Chernova, S., and Paxton, C. Structdiffusion: Language-guided creation of physically-valid structures using unseen objects. In RSS 2023, 2023 b
work page 2023
-
[26]
Controllable and compositional generation with latent-space energy-based models
Nie, W., Vahdat, A., and Anandkumar, A. Controllable and compositional generation with latent-space energy-based models. Advances in Neural Information Processing Systems, 34: 0 13497--13510, 2021
work page 2021
-
[28]
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 0 (1): 0 5485--5551, 2020
work page 2020
-
[29]
High-resolution image synthesis with latent diffusion models
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022
work page 2022
-
[30]
Shi, C., Ni, H., Li, K., Han, S., Liang, M., and Min, M. R. Exploring compositional visual generation with latent classifier guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 853--862, 2023
work page 2023
-
[34]
Modular action concept grounding in semantic video prediction
Yu, W., Chen, W., Yin, S., Easterbrook, S., and Garg, A. Modular action concept grounding in semantic video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3605--3614, 2022
work page 2022
-
[35]
Zhang, E., Lu, Y., Wang, W. Y., and Zhang, A. Lad: Language augmented diffusion for reinforcement learning. In Second Workshop on Language and Reinforcement Learning, 2022
work page 2022
-
[36]
Adding conditional control to text-to-image diffusion models
Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3836--3847, 2023
work page 2023
-
[38]
arXiv preprint arXiv:2302.00111 , year=
Learning universal policies via text-guided video generation , author=. arXiv preprint arXiv:2302.00111 , year=
-
[39]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[40]
StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects , author =. 2023 , booktitle =
work page 2023
-
[41]
International Conference on Machine Learning , year=
AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners , author=. International Conference on Machine Learning , year=
-
[42]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Diffusion-based generation, optimization, and planning in 3d scenes , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[43]
Second Workshop on Language and Reinforcement Learning , year=
Lad: Language augmented diffusion for reinforcement learning , author=. Second Workshop on Language and Reinforcement Learning , year=
-
[44]
Imitating human behaviour with diffusion models
Imitating human behaviour with diffusion models , author=. arXiv preprint arXiv:2301.10677 , year=
-
[45]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Diffusion policy: Visuomotor policy learning via action diffusion , author=. arXiv preprint arXiv:2303.04137 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Advances in Neural Information Processing Systems , volume=
Controllable and compositional generation with latent-space energy-based models , author=. Advances in Neural Information Processing Systems , volume=
-
[47]
Advances in Neural Information Processing Systems , volume=
Unsupervised learning of compositional energy concepts , author=. Advances in Neural Information Processing Systems , volume=
-
[48]
Advances in Neural Information Processing Systems , volume=
Learning to compose visual relations , author=. Advances in Neural Information Processing Systems , volume=
-
[49]
arXiv preprint arXiv:2304.14391 , year=
Energy-based models as zero-shot planners for compositional scene rearrangement , author=. arXiv preprint arXiv:2304.14391 , year=
-
[50]
International Conference on Machine Learning , pages=
Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[51]
Composable Text Controls in Latent Space with ODE s
Liu, Guangyi and Feng, Zeyu and Gao, Yuan and Yang, Zichao and Liang, Xiaodan and Bao, Junwei and He, Xiaodong and Cui, Shuguang and Li, Zhen and Hu, Zhiting. Composable Text Controls in Latent Space with ODE s. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.1030
-
[52]
Planning with Diffusion for Flexible Behavior Synthesis
Planning with diffusion for flexible behavior synthesis , author=. arXiv preprint arXiv:2205.09991 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Is Conditional Generative Modeling all you need for Decision-Making?
Is conditional generative modeling all you need for decision-making? , author=. arXiv preprint arXiv:2211.15657 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
European Conference on Computer Vision , pages=
Compositional visual generation with composable diffusion models , author=. European Conference on Computer Vision , pages=. 2022 , organization=
work page 2022
-
[55]
The Journal of Machine Learning Research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=
work page 2020
-
[56]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[57]
RT-1: Robotics Transformer for Real-World Control at Scale
Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Bridge data: Boosting generalization of robotic skills with cross-domain datasets , author=. arXiv preprint arXiv:2109.13396 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
IEEE Robotics and Automation Letters , year=
Interactive language: Talking to robots in real time , author=. IEEE Robotics and Automation Letters , year=
- [60]
-
[61]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open x-embodiment: Robotic learning datasets and rt-x models , author=. arXiv preprint arXiv:2310.08864 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [62]
-
[63]
Conference on Robot Learning , pages=
Bc-z: Zero-shot task generalization with robotic imitation learning , author=. Conference on Robot Learning , pages=. 2022 , organization=
work page 2022
-
[64]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[65]
arXiv preprint arXiv:2309.08587 , year=
Compositional Foundation Models for Hierarchical Planning , author=. arXiv preprint arXiv:2309.08587 , year=
-
[66]
arXiv preprint arXiv:2310.06114 , year=
Learning Interactive Real-World Simulators , author=. arXiv preprint arXiv:2310.06114 , year=
-
[67]
European Conference on Computer Vision (ECCV) , year=
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , author=. European Conference on Computer Vision (ECCV) , year=
- [68]
-
[69]
Multilingual Constituency Parsing with Self-Attention and Pre-Training
Multilingual constituency parsing with self-attention and pre-training , author=. arXiv preprint arXiv:1812.11760 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
IEEE Robotics and Automation Letters , volume=
Rlbench: The robot learning benchmark & learning environment , author=. IEEE Robotics and Automation Letters , volume=. 2020 , publisher=
work page 2020
-
[71]
Imagen Video: High Definition Video Generation with Diffusion Models
Imagen video: High definition video generation with diffusion models , author=. arXiv preprint arXiv:2210.02303 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Make-a-video: Text-to-video generation without text-video data , author=. arXiv preprint arXiv:2209.14792 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Ego4d: Around the world in 3,000 hours of egocentric video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[74]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Grounding dino: Marrying dino with grounded pre-training for open-set object detection , author=. arXiv preprint arXiv:2303.05499 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[75]
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Perceiver io: A general architecture for structured inputs & outputs , author=. arXiv preprint arXiv:2107.14795 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Conference on Robot Learning , pages=
Instruction-driven history-aware policies for robotic manipulations , author=. Conference on Robot Learning , pages=. 2023 , organization=
work page 2023
-
[77]
Learning to Act from Actionless Videos through Dense Correspondences , author=. 2023 , eprint=
work page 2023
-
[78]
arXiv preprint arXiv:2310.09629 , year=
Adaptive Online Replanning with Diffusion Models , author=. arXiv preprint arXiv:2310.09629 , year=
-
[79]
arXiv preprint arXiv:2306.01872 , year=
Probabilistic Adaptation of Text-to-Video Models , author=. arXiv preprint arXiv:2306.01872 , year=
-
[80]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Exploring Compositional Visual Generation with Latent Classifier Guidance , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[81]
arXiv preprint arXiv:2004.11714 , year=
Residual energy-based models for text generation , author=. arXiv preprint arXiv:2004.11714 , year=
-
[82]
arXiv preprint arXiv:2401.01952 , year=
Instruct-Imagen: Image Generation with Multi-modal Instruction , author=. arXiv preprint arXiv:2401.01952 , year=
-
[83]
arXiv preprint arXiv:2311.10709 , year=
Emu video: Factorizing text-to-video generation by explicit image conditioning , author=. arXiv preprint arXiv:2311.10709 , year=
-
[84]
arXiv preprint arXiv:2006.15327 , year=
Compositional video synthesis with action graphs , author=. arXiv preprint arXiv:2006.15327 , year=
-
[85]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Modular action concept grounding in semantic video prediction , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.