pith. sign in

arxiv: 2606.18180 · v1 · pith:IDNOAMSInew · submitted 2026-06-16 · 💻 cs.CV

EgoCS-400K: An Egocentric Gameplay Dataset for World Models

Pith reviewed 2026-06-27 01:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric video datasetworld modelsgameplay trajectoriesaction-conditioned predictionCounter-Strikefirst-person videosinteractive modelinghuman gameplay data
0
0 comments X

The pith

EgoCS-400K supplies over 400,000 first-person videos with aligned human actions, states, and events from Counter-Strike matches to support world model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EgoCS-400K as a large-scale dataset of egocentric gameplay videos from professional Counter-Strike matches. It is constructed by parsing public demos to extract trajectories, render first-person views, and align them with player actions, camera movements, game states, and events. This provides the temporally synchronized video-action-language data that world models need for tasks involving future prediction and scene understanding. Existing data sources fall short because web videos lack actions while robotic datasets are too small and simulators lack human interaction scale. The dataset thus offers a practical way to scale up training for interactive visual models.

Core claim

By extracting player states, view directions, movements, inputs, weapon usage, game events, and round context from public CS and CS2 match demos and rendering clean first-person videos from the same trajectories, EgoCS-400K creates a dataset of over 400,000 videos and 10,000 hours that connects visual observations directly to human actions, camera motion, states, and events for interactive world modeling.

What carries the argument

The replay-grounded construction pipeline that parses professional match demos, extracts rich annotations, and renders temporally aligned first-person videos.

If this is right

  • Action-conditioned future prediction becomes feasible at large scale.
  • State- and event-aware scene rollout can be trained on real human trajectories.
  • Replay-grounded captioning links language to game events and actions.
  • Agent egocentric action understanding improves with aligned viewpoint and input data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained on this data could transfer to real-world navigation or robotics if the game dynamics capture similar physics.
  • Future work might extend the dataset to other games or add multi-agent interactions for more complex world models.
  • The scale allows testing whether world models can handle long-horizon predictions in dynamic environments.

Load-bearing premise

Public professional match demos can be parsed, replayed, rendered, and aligned to produce first-person videos that faithfully represent human gameplay without major artifacts or loss of information.

What would settle it

A side-by-side comparison showing that models trained on the rendered EgoCS-400K videos achieve lower accuracy on future scene prediction than the same models trained on directly captured first-person footage from the same matches.

Figures

Figures reproduced from arXiv: 2606.18180 by Dong Liang, Fang Liu, Gerhard P. Hancke, Rongjin Guo, Rynson W. H. Lau, Tianyu Huang, Yuhao Liu.

Figure 1
Figure 1. Figure 1: Overview of EgoCS-400K. EgoCS-400K is a large-scale multi-grained egocentric Counter-Strike dataset built from first-person gameplay videos across CS:GO and CS2 maps, covering 10,000+ hours, 40,000+ rounds, 1,000+ matches, 13 maps, and 10 player viewpoints. The dataset provides a hierarchical organization from player-view sequences to segments, protected action chains, protected atomic actions, and per-tic… view at source ↗
Figure 2
Figure 2. Figure 2: Construction pipeline of EgoCS-400K. Starting from public Counter-Strike demos, we first collect raw replay files, render synchronized first-person videos, and filter invalid rendered videos. We then parse each demo into player-level per-tick data. From the parsed action and spatial signals, we construct keyboard/mouse annotations, atomic action sequences, protected action chains, and video segments. Final… view at source ↗
Figure 3
Figure 3. Figure 3: shows a four-second example with synchronized visual and annotation layers. The visualization includes sampled first-person frames, keyboard and mouse traces, action timeline, and the generated prompt for the same temporal window. This example illustrates how EgoCS-400K connects short visual changes with action structure: weapon switching, inspection, airborne movement, grenade preparation, projectile flig… view at source ↗
read the original abstract

The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view-angle changes, weapon usage, game events, and round-level context, and render clean first-person videos from the same trajectories. EgoCS-400K contains over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS-400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real-world embodied data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces EgoCS-400K, a large-scale egocentric dataset derived from public professional Counter-Strike and CS2 match demos. It claims to contain over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds across 13 maps, with extracted player states, view directions, movements, inputs, events, and rendered videos, enabling tasks such as action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding.

Significance. If the data fidelity holds, EgoCS-400K would provide a significant resource for world model research by supplying temporally aligned video-action-state trajectories at a scale far exceeding existing robotic or simulator datasets, while incorporating human gameplay diversity. The curation from public demos is a practical approach to achieving this scale.

major comments (1)
  1. abstract, data-construction paragraph: The central claim that the dataset supplies reliable trajectories for interactive modeling rests on the unverified assumption that public demos enable parsing, replaying, rendering, and temporal alignment without significant fidelity loss or artifacts. No quantitative validation, error rates, or ground-truth comparisons (e.g., action parsing accuracy or rendered vs. original frame fidelity) are provided, which is load-bearing for claims of utility in action-conditioned prediction and state-aware rollout.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the manuscript. We address the single major comment below.

read point-by-point responses
  1. Referee: abstract, data-construction paragraph: The central claim that the dataset supplies reliable trajectories for interactive modeling rests on the unverified assumption that public demos enable parsing, replaying, rendering, and temporal alignment without significant fidelity loss or artifacts. No quantitative validation, error rates, or ground-truth comparisons (e.g., action parsing accuracy or rendered vs. original frame fidelity) are provided, which is load-bearing for claims of utility in action-conditioned prediction and state-aware rollout.

    Authors: We agree that the manuscript does not currently include quantitative validation of parsing accuracy or rendered-frame fidelity, and that this is a substantive point for claims about reliable trajectories. The construction uses standard, publicly documented demo parsers for CS:GO/CS2 that are designed to produce bit-exact replays of the recorded inputs and states; however, we did not report error rates or direct comparisons in the submitted version. In the revision we will add a dedicated validation subsection that reports (1) action-parsing accuracy measured against manually annotated ground-truth on a held-out sample of rounds, (2) frame-level visual fidelity metrics between rendered videos and the original demo playback, and (3) a summary of any observed artifacts or alignment offsets. These additions will directly support the utility claims for action-conditioned prediction and state-aware rollout. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset curation paper with no derivation or fitted quantities

full rationale

The paper introduces a dataset (EgoCS-400K) assembled from public CS/CS2 match demos. No equations, predictions, parameters, or derivation chain exist in the provided text. The central contribution is descriptive curation and task support statements; the data-construction pipeline is presented as an empirical process without self-referential reduction or load-bearing self-citation. This matches the default case of a self-contained dataset paper (score 0).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper's contribution rests on the domain assumption that CS match demos contain sufficient parseable information for accurate extraction of states and actions; no free parameters, axioms beyond standard parsing, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5854 in / 1276 out tokens · 50701 ms · 2026-06-27T01:36:42.617082+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

  2. [2]

    MineWorld: A real-time and open-source interactive world model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

    Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: A real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388,

  3. [3]

    Weilbach, Martyna E

    Yingchen He, Christian D. Weilbach, Martyna E. Wojciechowska, Yiyuan Sun, Yuxuan Zhang, and Frank Wood. Plaicraft: Large-scale time-aligned vision-speech-action dataset for embodied ai.arXiv preprint arXiv:2505.12707,

  4. [4]

    Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, and Hao Tan. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

  5. [5]

    HunyuanWorld 1.0: Generating immersive, explorable, and interactive 3D worlds from words or pixels.arXiv preprint arXiv:2507.21809,

    HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, et al. HunyuanWorld 1.0: Generating immersive, explorable, and interactive 3D worlds from words or pixels.arXiv preprint arXiv:2507.21809,

  6. [6]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. InarXiv preprint arXiv:1705.06950,

  7. [7]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

  8. [8]

    Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201,

    Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201,

  9. [9]

    Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744,

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Ming- min Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744,

  10. [10]

    Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871,

    JisuNam,YicongHong,Chun-HaoPaulHuang,FengLiu,JoungBinLee,JiyoungKim,SiyoonJin,YunsungLee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, and Yang Zhou. Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871,

  11. [11]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Abhishek Padalkar, Alex Pooley, Ajay Jain, Alex Bewley, Alexander Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864,

  12. [12]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  13. [13]

    arXiv preprint arXiv:2509.21790 (2025)

    YuShang, LeiJin, YidingMa, XinZhang, ChenGao, WeiWu, andYongLi. Longscape: Advancinglong-horizon embodied world models with context-aware moe.arXiv preprint arXiv:2509.21790,

  14. [14]

    Hunyuan-gamecraft-2: Instruction-following interactive game world model

    Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, and Qinglin Lu. Hunyuan-gamecraft-2: Instruction-following interactive game world model. arXiv preprint arXiv:2511.23429,

  15. [15]

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter

    URLhttps://arxiv.org/ abs/2512.04797. Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837,

  16. [16]

    Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, and Yahui Zhou. Matrix-game 3.0: Real-time and streaming interactive world model with long-h...

  17. [17]

    Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701,