DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams
Pith reviewed 2026-06-26 14:22 UTC · model grok-4.3
The pith
Agentic tailoring converts high-entropy raw multimodal streams into structured data that boosts downstream adaptation under limited training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the two-stage pipeline grounding generative semantic synthesis in deterministic factual anchors yields a dataset large enough to train DataClaw0-9B, which then synergizes SFT and GRPO to align with complex refinement intents and produces high-information-density tailored data; downstream evaluations confirm this tailored data enables efficient model adaptation to new tasks under limited training data regimes.
What carries the argument
The two-stage pipeline that grounds generative semantic synthesis in deterministic factual anchors, which creates the training data needed to turn data tailoring into a trainable agentic skill.
Load-bearing premise
The two-stage pipeline that grounds generative semantic synthesis in deterministic factual anchors is sufficient to overcome the data scarcity bottleneck for training high-order data-tailoring capabilities.
What would settle it
A controlled experiment in which models post-trained on DataClaw0-tailored data show no measurable gains over models trained on the same volume of raw streams or heuristically processed data in the video generation, real-world VQA, or GUI navigation tasks.
Figures
read the original abstract
Massive unstructured multimodal streams suffer from high "data entropy," impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms, heavily reliant on heuristic rules or general VLMs, are costly, monotonous, and fail to unlock the deep procedural logic embedded in raw data. We elevate data processing to a learnable capability, proposing a paradigm shift towards Agentic Data Tailoring, which actively refining and structuring data to align with diverse user and downstream intents. To overcome the data scarcity bottleneck in training such high-order capabilities, we design a two-stage pipeline grounding generative semantic synthesis in deterministic Factual Anchors, yielding a large-scale dataset spanning five core physical and digital domains. Building upon this, $\text{DataClaw}_0$-9B model synergizes Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), achieving robust alignment with complex refinement and tailoring intents. To systematically quantify this capability, we construct $\text{DataClaw}_0$-val, the first benchmark dedicated to data refinement. Crucially, we adopt downstream post-training as the ultimate validation touchstone. Evaluations on video generation, real-world VQA, and GUI navigation confirm that $\text{DataClaw}_0$ delivers high-information-density tailored data, facilitating efficient model adaptation to new tasks under limited training data regimes. Project page: https://czjdsg.github.io/MakeAnyData
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Agentic Data Tailoring as a paradigm shift from passive annotation of multimodal streams. It introduces a two-stage pipeline that grounds generative semantic synthesis in deterministic Factual Anchors to produce a large-scale dataset across five domains, trains the DataClaw0-9B model via SFT combined with GRPO, releases the DataClaw0-val benchmark for data refinement, and validates the approach by using the tailored data for downstream post-training on video generation, real-world VQA, and GUI navigation tasks, claiming that the resulting data has high information density and enables efficient adaptation under limited-data regimes.
Significance. If the central claim holds under controlled conditions, the work could meaningfully advance data-centric AI by treating tailoring as a learnable capability rather than a heuristic process. The use of downstream post-training as the primary validation metric is a reasonable choice for assessing practical utility. However, the absence of any quantitative results, ablation studies, or experimental controls in the manuscript prevents assessment of whether the claimed gains are attributable to the agentic method itself.
major comments (2)
- [Abstract] Abstract: the claim that 'evaluations on video generation, real-world VQA, and GUI navigation confirm that DataClaw0 delivers high-information-density tailored data' is load-bearing for the central thesis, yet the manuscript provides no metrics, baselines, or description of controls. Without evidence that data volume, source distribution, and annotation cost were held fixed while varying only the tailoring procedure, observed downstream gains cannot be isolated from volume or domain effects.
- [Abstract] Abstract (two-stage pipeline description): the assertion that grounding generative semantic synthesis in deterministic Factual Anchors is 'sufficient to overcome the data scarcity bottleneck for training high-order data-tailoring capabilities' is presented without any derivation, construction details for the anchors, or ablation showing that this step enables the subsequent SFT+GRPO training of the 9B model. This assumption is central to the data-generation claim but remains unexamined.
minor comments (2)
- [Abstract] Notation for the model name is inconsistent (DataClaw0 vs. $ ext{DataClaw}_0$); standardize throughout.
- [Abstract] The project page URL is given but no reference to it appears in the main text; add a citation or footnote if the page contains additional implementation details.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger empirical grounding. We agree that the central claims require explicit quantitative support and controlled experiments, and we will revise the manuscript to address these points directly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'evaluations on video generation, real-world VQA, and GUI navigation confirm that DataClaw0 delivers high-information-density tailored data' is load-bearing for the central thesis, yet the manuscript provides no metrics, baselines, or description of controls. Without evidence that data volume, source distribution, and annotation cost were held fixed while varying only the tailoring procedure, observed downstream gains cannot be isolated from volume or domain effects.
Authors: We acknowledge that the current manuscript does not present the specific metrics, baselines, or explicit control descriptions in sufficient detail within the abstract or main body to fully isolate the tailoring procedure. In revision we will add a dedicated experimental controls subsection that reports data volume, source distribution, and annotation cost for all compared conditions, along with the key downstream metrics (e.g., FID, accuracy, success rate) and baseline comparisons. The abstract will be updated to reference these quantitative results. revision: yes
-
Referee: [Abstract] Abstract (two-stage pipeline description): the assertion that grounding generative semantic synthesis in deterministic Factual Anchors is 'sufficient to overcome the data scarcity bottleneck for training high-order data-tailoring capabilities' is presented without any derivation, construction details for the anchors, or ablation showing that this step enables the subsequent SFT+GRPO training of the 9B model. This assumption is central to the data-generation claim but remains unexamined.
Authors: We agree that the manuscript currently lacks a derivation, explicit construction details for the Factual Anchors, and an ablation isolating their contribution. In the revised version we will add a new subsection under the data-generation pipeline that provides the formal construction of the anchors, the grounding mechanism, and an ablation study comparing SFT+GRPO performance with and without the anchor stage to demonstrate its necessity for overcoming data scarcity. revision: yes
Circularity Check
No significant circularity; empirical pipeline with no derivation chain
full rationale
The manuscript presents a two-stage data tailoring pipeline and downstream task evaluations without any equations, derivations, or parameter-fitting steps that reduce to self-definition or self-citation. Claims rest on constructed datasets and post-training results rather than mathematical reductions; no load-bearing uniqueness theorems or ansatzes are invoked. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Factual Anchors
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv:2502.13923, 2025. 1, 3
Pith/arXiv arXiv 2025
-
[2]
Qwen3-vl: Sharper vision, deeper thought, broader action
QwenTeam. Qwen3-vl: Sharper vision, deeper thought, broader action. https: //qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from= research.latest-advancements-list, 2025. 1
2025
-
[3]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 1, 3
2023
-
[4]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv:2410.21276, 2024. 1, 3
Pith/arXiv arXiv 2024
-
[5]
Gpt-4 technical report.arXiv:2303.08774, 2023
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv:2303.08774, 2023. 1, 3
Pith/arXiv arXiv 2023
-
[7]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3
Pith/arXiv arXiv 2025
-
[8]
OpenAI Team. Gpt 5. https://openai.com/zh-Hans-CN/index/introducing-gpt-5/, 2025. 1
2025
-
[9]
https://www.anthropic.com/news/ claude-3-family, March 2024
Introducing the next generation of claude. https://www.anthropic.com/news/ claude-3-family, March 2024. 1
2024
-
[10]
Minimax-m2.7
MiniMax-AI. Minimax-m2.7. https://github.com/MiniMax-AI/MiniMax-M2.7, 2025. Accessed: 2026-05-05. 1
2025
-
[11]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
Pith/arXiv arXiv 2025
-
[12]
Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026. 1
Pith/arXiv arXiv 2026
-
[13]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1, 3, 7, 31
2022
-
[14]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019. 1, 3 10
2019
-
[15]
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. 1, 3
2017
-
[16]
An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023. 1
2023
-
[17]
Androidworld: A dynamic benchmarking environment for autonomous agents, 2024
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024. 1
2024
-
[18]
Mind2web: Towards a generalist agent for the web, 2023
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. 1, 3
2023
-
[19]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1...
-
[20]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 1
2019
-
[21]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. 1, 3
2024
-
[22]
Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023. 1
Pith/arXiv arXiv 2023
-
[23]
Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023. 1, 2
2023
-
[24]
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023. 1, 3
Pith/arXiv arXiv 2023
-
[25]
Llava-onevision: Easy visual task transfer.TMLR, 2025
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2025. 1
2025
-
[26]
Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models.arXiv preprint arXiv:2402.05935, 2024. 1, 3
arXiv 2024
-
[27]
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023. 1, 3
2023
-
[28]
Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, and Yi R Fung. Vlm2-bench: A closer look at how well vlms implicitly link explicit matching visual cues.arXiv:2502.12084, 2025. 2, 3
arXiv 2025
-
[29]
Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces.arXiv preprint arXiv:2412.14171, 2024. 2, 3 11
Pith/arXiv arXiv 2024
-
[30]
Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025. 2, 3
Pith/arXiv arXiv 2025
-
[31]
Thinking in space: How multimodal large language models see, remember and recall spaces
Disheng Liu, Jingyu Wang, Zihan Chen, and Yuhao Zhang. Thinking in space: How multimodal large language models see, remember and recall spaces. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 4
2024
-
[32]
Mico: Multi-image contrast for reinforcement visual reasoning.arXiv preprint arXiv:2506.22434, 2025
Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, and Hengshuang Zhao. Mico: Multi-image contrast for reinforcement visual reasoning.arXiv preprint arXiv:2506.22434, 2025. 2, 4
arXiv 2025
-
[33]
Jiawei Li, Yang Zhang, Ming Chen, and Hao Wang. Viewspatial-bench: Evaluating multi- perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500,
-
[34]
Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024. 2, 4
2024
-
[35]
Evaluating dependencies in fact editing for language models: Specificity and implication awareness
Zichao Li, Ines Arous, Siva Reddy, and Jackie Chi Kit Cheung. Evaluating dependencies in fact editing for language models: Specificity and implication awareness. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 7623–7636, 2023. 2, 4
2023
-
[36]
Data-centric artificial intelligence: A survey.arXiv preprint arXiv:2303.10158,
Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. Data-centric artificial intelligence: A survey.arXiv preprint arXiv:2303.10158,
-
[37]
Dataperf: Benchmarks for data-centric ai development.Advances in Neural Information Processing Systems, 36:5320– 5347, 2023
Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, et al. Dataperf: Benchmarks for data-centric ai development.Advances in Neural Information Processing Systems, 36:5320– 5347, 2023. 2
2023
-
[38]
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instruc- tions, 2022. 2, 4
2022
-
[39]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qing- wei Lin, and Daxin Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations,
-
[40]
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023. 2, 4
Pith/arXiv arXiv 2023
-
[41]
Alpagasus: Training a better alpaca with fewer data
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data. InThe Twelfth International Conference on Learning Representations,
-
[42]
Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M Wolff, and Xin Huang. Vlm-ad: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024. 3
arXiv 2024
-
[43]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, June 2024. 3 12
2024
-
[44]
RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Oral Presentation. 3
2025
-
[45]
Multi- modal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multi- modal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023. 3
Pith/arXiv arXiv 2023
-
[46]
Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Measuring and improving chain-of-thought reasoning in vision-language models.Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, pages 192–210, 2024. 3
2024
-
[47]
Shengju Qian, Hao Shao, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual CoT: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37, 2024. 3
2024
-
[48]
Coin: A large-scale dataset for comprehensive instructional video analysis
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019. 3
2019
-
[49]
Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025
Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, and Tong He. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025. 3
2025
-
[50]
Worldmem: Long-term consistent world simulation with memory
Zeqi Xiao, LAN Yushi, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 3
-
[51]
The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020. 3
2020
-
[52]
Grid: Visual layout generation.arXiv preprint arXiv:2412.10718, 2024
Cong Wan, Xiangyang Luo, Zijian Cai, Yiren Song, Yunlong Zhao, Yifan Bai, Yuhang He, and Yihong Gong. Grid: Visual layout generation.arXiv preprint arXiv:2412.10718, 2024. 3
arXiv 2024
-
[53]
Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, and Hengshuang. Unireal: Universal im- age generation and editing via learning real-world dynamics.arXiv preprint arXiv:2412.07774,
-
[54]
Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...
Pith/arXiv arXiv 2025
-
[55]
Qwen-image technical report, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
2025
-
[56]
Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,
-
[57]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 3
Pith/arXiv arXiv 2024
-
[58]
Gr-3 technical report.arXiv preprint arXiv:2507.15493,
Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493,
-
[59]
Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 3
Pith/arXiv arXiv 2025
-
[60]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3
Pith/arXiv arXiv 2025
-
[61]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2024. 3
2024
-
[62]
LLaV A-RLHF: Aligning large multimodal models with factually augmented RLHF
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yujia Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. LLaV A-RLHF: Aligning large multimodal models with factually augmented RLHF. https://llava-rlhf.github.io/,
-
[63]
Accessed: 2024-10-30. 3
2024
-
[64]
Towards comprehensive reasoning in vision-language models.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2024
Yiwei Wang, Kai-Wei Chang, and Ming-Hsuan Yang. Towards comprehensive reasoning in vision-language models.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2024. 3
2024
-
[65]
En- hancing advanced visual reasoning ability of large language models.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1915–1929, 2024
Zhiyuan Li, Dongnan Liu, Chaoyi Zhang, Heng Wang, Tengfei Xue, and Weidong Cai. En- hancing advanced visual reasoning ability of large language models.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1915–1929, 2024. 3
2024
-
[66]
Remot: Reinforcement learning with motion contrast triplets.arXiv preprint arXiv:2603.00461, 2026
Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, and Yihong Gong. Remot: Reinforcement learning with motion contrast triplets.arXiv preprint arXiv:2603.00461, 2026. 3, 7, 31
Pith/arXiv arXiv 2026
-
[67]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 3
2009
-
[68]
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023. 3
Pith/arXiv arXiv 2023
-
[69]
Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 3
Pith/arXiv arXiv 2025
-
[70]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 3
2024
-
[71]
Vlm4d: Towards spatiotem- poral awareness in vision language models
Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi. Vlm4d: Towards spatiotem- poral awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, 2025. 3
2025
-
[72]
Jiangyang Li, Cong Wan, SongLin Dong, Chenhao Ding, Qiang Wang, Zhiheng Ma, and Yihong Gong. Trajectory-diversity-driven robust vision-and-language navigation.arXiv preprint arXiv:2603.15370, 2026. 3 14
arXiv 2026
-
[73]
Canonswap: High-fidelity and consistent video face swapping via canonical space modulation
Xiangyang Luo, Ye Zhu, Yunfei Liu, Lijian Lin, Cong Wan, Zijian Cai, Yu Li, and Shao-Lun Huang. Canonswap: High-fidelity and consistent video face swapping via canonical space modulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10064–10074, 2025. 3
2025
-
[74]
Jianchao Zhao, Huoren Yang, Hu Yusong, Yuyang Gao, Qiguan Ou, Cong Wan, SongLin Dong, Zhiheng Ma, and Yihong Gong. Retrieve-then-steer: Online success memory for test-time adaptation of generative vlas.arXiv preprint arXiv:2605.10094, 2026. 3
Pith/arXiv arXiv 2026
-
[75]
Jiangyang Li, Cong Wan, Changjie Wu, Songlin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, et al. Prosr: Process-shaped spatial reasoning for reliable chain-of-thought in vlms.arXiv preprint arXiv:2605.25524, 2026. 3
Pith/arXiv arXiv 2026
-
[76]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. 4
2023
-
[77]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023. 4
2023
-
[78]
Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. InAdvances in Neural Information Processing Systems, 2023. 4
2023
-
[79]
Griffiths, Yuan Cao, and Karthik Narasimhan
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, 2023. 4
2023
-
[80]
Re- flexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023. 4
2023
-
[81]
V oyager: An open-ended embodied agent with large language models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023. 4
Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.