Recognition: unknown
EAST: Early Action Prediction Sampling Strategy with Token Masking
Pith reviewed 2026-05-10 04:51 UTC · model grok-4.3
The pith
A single randomized sampling strategy during training lets one model generalize to any observation ratio for early action prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that sampling a random time step to separate observed from unobserved frames during training, combined with joint optimization on observed and future oracle representations, enables a single model to perform early action prediction at every observation ratio. Token masking reduces memory and compute with negligible accuracy cost. When paired with a forecasting decoder, the resulting EAST framework exceeds previous best accuracies by 10.1, 7.7, and 3.9 percentage points on NTU60, SSv2, and UCF101 respectively.
What carries the argument
The randomized training strategy that samples a variable time step separating observed and unobserved video frames, enabling one model to train for all ratios.
If this is right
- A single model replaces the need for separate models trained for each observation ratio.
- Accuracy improves by double-digit margins on NTU60 and by several points on SSv2 and UCF101.
- Training runs twice as fast and uses half the memory thanks to token masking.
- An encoder-only architecture becomes competitive when joint observed and oracle learning is used.
Where Pith is reading between the lines
- Real-time video systems could use one model across varying input lengths without retraining per scenario.
- The sampling idea could transfer to other partial-sequence tasks such as early event detection in sensor data.
- Combining the approach with stronger forecasting decoders might yield further gains on additional datasets.
Load-bearing premise
That randomly choosing different observation points during training will produce a model whose accuracy stays high at every possible test-time ratio without needing separate models or suffering sharp drops at very low ratios.
What would settle it
Test the trained EAST model at extreme low observation ratios (for example 10 percent of the video) and compare its accuracy to models trained specifically for those same low ratios; a large gap in favor of the per-ratio models would falsify the seamless generalization claim.
Figures
read the original abstract
Early action prediction seeks to anticipate an action before it fully unfolds, but limited visual evidence makes this task especially challenging. We introduce EAST, a simple and efficient framework that enables a model to reason about incomplete observations. In our empirical study, we identify key components when training early action prediction models. Our key contribution is a randomized training strategy that samples a time step separating observed and unobserved video frames, enabling a single model to generalize seamlessly across all test-time observation ratios. We further show that joint learning on both observed and future (oracle) representations significantly boosts performance, even allowing an encoder-only model to excel. To improve scalability, we propose a token masking procedure that cuts memory usage in half and accelerates training by 2x with negligible accuracy loss. Combined with a forecasting decoder, EAST sets a new state of the art on NTU60, SSv2, and UCF101, surpassing previous best work by 10.1, 7.7, and 3.9 percentage points, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EAST, a framework for early action prediction that uses a randomized training strategy to sample a separation time step between observed and unobserved video frames, allowing a single model to generalize across all test-time observation ratios. It incorporates joint learning on observed and oracle (future) representations, a token masking procedure to reduce memory usage by half and accelerate training by 2x, and combines with a forecasting decoder to achieve new state-of-the-art results on NTU60, SSv2, and UCF101, surpassing prior best work by 10.1, 7.7, and 3.9 percentage points respectively.
Significance. If the empirical claims hold under detailed validation, the work would offer a practical advance in early action prediction by showing that one randomized sampling regimen can replace multiple ratio-specific models while adding efficiency gains from token masking. The reported margins are large enough to warrant attention if they prove robust to ablations and statistical checks.
major comments (2)
- [Abstract] Abstract: The central claim that a single randomized sampling strategy enables 'seamless' generalization to every test-time observation ratio rests on an unverified assumption. No details are given on the exact sampling distribution over time steps, per-ratio accuracy curves, or direct comparisons to ratio-specific baselines. If accuracy drops sharply at extreme ratios (as is common in this task), the single-model efficiency advantage does not hold and the headline gains may instead derive primarily from the forecasting decoder or oracle supervision.
- [Abstract] Abstract: Large absolute gains (10.1/7.7/3.9 pp) are reported without information on statistical significance, exact baseline implementations, data splits, number of runs, or whether the improvements survive ablation of the oracle component. This information is required to attribute improvements specifically to the proposed sampling strategy rather than other architectural choices.
minor comments (1)
- [Abstract] The abstract would be clearer if it briefly defined the token masking ratio and its relation to the randomized time-step sampling.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our submission. The comments highlight areas where additional clarity on the sampling strategy and empirical reporting will strengthen the presentation. We respond to each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that a single randomized sampling strategy enables 'seamless' generalization to every test-time observation ratio rests on an unverified assumption. No details are given on the exact sampling distribution over time steps, per-ratio accuracy curves, or direct comparisons to ratio-specific baselines. If accuracy drops sharply at extreme ratios (as is common in this task), the single-model efficiency advantage does not hold and the headline gains may instead derive primarily from the forecasting decoder or oracle supervision.
Authors: We appreciate the referee's emphasis on verifying the generalization claim. The full manuscript (Section 3.2) specifies that the separation time step is sampled uniformly at random during training. Experiments include per-ratio accuracy curves (Figure 3) that demonstrate stable performance without sharp degradation at low or high observation ratios. Direct comparisons to ratio-specific baselines appear in Table 2 and the supplementary material, where the single EAST model matches or exceeds them. Ablations in Section 4.3 isolate the sampling strategy's contribution from the forecasting decoder and oracle supervision. We will revise the abstract to reference the uniform sampling and these supporting analyses. revision: partial
-
Referee: [Abstract] Abstract: Large absolute gains (10.1/7.7/3.9 pp) are reported without information on statistical significance, exact baseline implementations, data splits, number of runs, or whether the improvements survive ablation of the oracle component. This information is required to attribute improvements specifically to the proposed sampling strategy rather than other architectural choices.
Authors: We agree that these details are essential for proper attribution. In the revised manuscript we will augment the abstract and experimental section with: results averaged over three independent runs including standard deviations; explicit use of the standard NTU60/SSv2/UCF101 splits from prior work; descriptions of baseline re-implementations; and ablations confirming that gains persist after removing the oracle component (while still crediting the sampling strategy). We will also add notes on statistical significance testing. These additions directly address the request to isolate the sampling strategy's role. revision: yes
Circularity Check
No circularity: purely empirical method with benchmark validation
full rationale
The paper presents EAST as an empirical training strategy (randomized time-step sampling plus token masking) whose effectiveness is demonstrated solely through accuracy numbers on public datasets (NTU60, SSv2, UCF101). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim—that one model generalizes across observation ratios—is an empirical assertion validated by reported SOTA margins rather than a tautological reduction to its own inputs. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- time-step sampling distribution
- token masking ratio
axioms (1)
- domain assumption Joint supervised learning on both observed and oracle future representations improves early-prediction accuracy
Reference graph
Works this paper leans on
- [1]
-
[2]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021,
2021
-
[3]
Harris and Mike Stephens
Christopher G. Harris and Mike Stephens. A combined corner and edge detector. In Christopher J. Taylor (ed.),Proceedings of the Alvey Vision Conference, AVC 1988, Manchester, UK, September, 1988, pp. 1–6. Alvey Vision Club,
1988
-
[4]
Sunil Hwang, Jaehong Yoon, Youngwan Lee, and Sung Ju Hwang
doi: 10.1109/TPAMI.2018.2863279. Sunil Hwang, Jaehong Yoon, Youngwan Lee, and Sung Ju Hwang. Everest: Efficient masked video autoencoder by removing redundant spatiotemporal tokens. InInternational Conference on Ma- chine Learning,
-
[5]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950,
work page internal anchor Pith review arXiv
-
[6]
A discriminative model with multiple temporal scales for action prediction
Yu Kong, Dmitry Kit, and Yun Fu. A discriminative model with multiple temporal scales for action prediction. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 596–611. Springer,
2014
-
[7]
Uniformer: Unified transformer for efficient spatiotemporal representation learning
Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uni- former: Unified transformer for efficient spatiotemporal representation learning.arXiv preprint arXiv:2201.04676, 2022a. Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uni- formerv2: Spatiotemporal learning by arming image vits with vide...
-
[8]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983,
-
[9]
Learning activity progression in lstms for activity detection and early detection
Shugao Ma, Leonid Sigal, and Stan Sclaroff. Learning activity progression in lstms for activity detection and early detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1942–1950,
1942
-
[10]
Two-stream convolutional networks for action recogni- tion in videos
Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recogni- tion in videos. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.),Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, ...
2014
-
[11]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402,
work page internal anchor Pith review arXiv
-
[12]
Dauphin, and David Lopez-Paz
Hongyi Zhang, Moustapha Ciss´e, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empiri- cal risk minimization. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings,
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.