UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Amir Roshan Zamir; Khurram Soomro; Mubarak Shah

arxiv: 1212.0402 · v1 · submitted 2012-12-03 · 💻 cs.CV

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro , Amir Roshan Zamir , Mubarak Shah This is my paper

Pith reviewed 2026-05-11 01:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords UCF101human action recognitionvideo datasetaction classificationbenchmark datasetcomputer visionbag of wordsunconstrained videos

0 comments

The pith

UCF101 supplies a dataset of 101 human action classes drawn from over 13,000 unconstrained video clips.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UCF101 as the largest collection of video clips for human action recognition. It contains 101 classes, more than 13,000 clips, and 27 hours of footage taken from realistic user-uploaded videos that include camera motion and cluttered backgrounds. The authors report baseline results of 44.5 percent accuracy using a standard bag-of-words approach. They position the dataset as more challenging than prior collections because of its scale and the natural variability in the clips. The work supplies a new resource that allows algorithms to be tested under conditions closer to everyday video.

Core claim

UCF101 is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5 percent. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.

What carries the argument

The UCF101 dataset itself, organized into 101 action categories from web videos, together with the bag-of-words baseline that measures initial recognition performance.

If this is right

Action recognition algorithms can now be evaluated on a larger number of classes and clips than in earlier datasets.
Methods must handle camera motion and background clutter to exceed the reported baseline.
Future comparisons of recognition systems can use the 44.5 percent figure as a reference point for this scale of data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Subsequent datasets would need to surpass 101 classes or 13k clips to claim greater difficulty on the same criteria.
The resource could support development of systems for video search or surveillance that operate on uncontrolled footage.

Load-bearing premise

The collected videos sufficiently represent the variability and challenges of unconstrained real-world human actions.

What would settle it

A demonstration that a much larger or more varied collection of action videos exists or that the 44.5 percent baseline understates the dataset difficulty because of evaluation choices.

read the original abstract

We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UCF101 is a practical dataset release that gives the field a bigger, messier benchmark than prior UCF sets, but the 'most challenging' claim sits on scale and one baseline number without the comparisons that would make it stick.

read the letter

The core of this paper is the release of UCF101: 101 action classes, roughly 13,000 clips, and 27 hours of YouTube video. That is a clear step up in size and realism from the earlier UCF collections, and the authors supply a standard bag-of-words baseline that reaches 44.5 percent. For anyone building or testing action recognition pipelines, having a larger, unconstrained collection with camera motion and background clutter is immediately useful as a new yardstick.

Referee Report

2 major / 1 minor

Summary. The paper introduces UCF101 as the largest dataset of human actions, containing 101 classes, over 13,000 video clips, and 27 hours of data from realistic, unconstrained YouTube videos that include camera motion and cluttered backgrounds. It reports a baseline action recognition accuracy of 44.5% using a standard bag-of-words approach and claims that UCF101 is the most challenging action dataset due to its scale and unconstrained nature.

Significance. The release of a large-scale action recognition dataset with realistic video conditions would provide a valuable benchmark for the computer vision community if the data collection and baseline are fully documented. The 101-class scale extends prior work, but the significance of the 'most challenging' positioning depends on whether the low baseline accuracy is shown to stem from the added variability rather than pipeline specifics.

major comments (2)

[Abstract] Abstract: The assertion that UCF101 is 'currently the most challenging dataset of actions' rests on its descriptive attributes (101 classes, >13k clips, unconstrained YouTube videos) together with the 44.5% bag-of-words baseline, yet no equivalent bag-of-words numbers are provided on prior datasets such as UCF50 or HMDB51. Without these anchors the difficulty ranking remains an untested assertion.
[Baseline results] Baseline evaluation: The manuscript states an overall performance of 44.5% but supplies no details on the train/test splits, evaluation protocol (e.g., cross-validation folds or leave-one-out), or any measure of variance. This omission prevents assessment of whether the reported accuracy fairly demonstrates the dataset's difficulty.

minor comments (1)

[Abstract] The abstract and introduction should explicitly list the exact number of videos per class and any class-balance statistics to allow readers to judge diversity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing the UCF101 dataset. We address each major comment below and will revise the paper to improve clarity and support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that UCF101 is 'currently the most challenging dataset of actions' rests on its descriptive attributes (101 classes, >13k clips, unconstrained YouTube videos) together with the 44.5% bag-of-words baseline, yet no equivalent bag-of-words numbers are provided on prior datasets such as UCF50 or HMDB51. Without these anchors the difficulty ranking remains an untested assertion.

Authors: We agree that including bag-of-words baseline results on UCF50 and HMDB51 would provide stronger quantitative support for the relative difficulty claim. Our positioning of UCF101 as the most challenging is grounded in its objectively larger scale and the realistic, unconstrained video conditions (camera motion, cluttered backgrounds) that exceed those in prior datasets. In the revised manuscript, we will add a comparison table with baseline accuracies obtained using the identical bag-of-words pipeline on UCF50 and HMDB51 to enable direct assessment. revision: yes
Referee: [Baseline results] Baseline evaluation: The manuscript states an overall performance of 44.5% but supplies no details on the train/test splits, evaluation protocol (e.g., cross-validation folds or leave-one-out), or any measure of variance. This omission prevents assessment of whether the reported accuracy fairly demonstrates the dataset's difficulty.

Authors: We apologize for the insufficient detail in the baseline description. The reported 44.5% accuracy is the mean over the three standard train/test splits released with UCF101. We will expand the experimental section in the revised manuscript to explicitly describe the evaluation protocol, including the use of the three splits, the averaging procedure, and the standard deviation across splits to quantify variance. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with direct empirical baseline

full rationale

The manuscript introduces UCF101 by reporting collection statistics (101 classes, >13k clips, 27 hours) and a single standard bag-of-words baseline result of 44.5%. No equations, fitted parameters, predictions, or derivations exist that could reduce to the inputs by construction. Claims of scale and challenge rest on descriptive counts and qualitative video-source description rather than any self-referential loop or self-citation chain. The baseline is an external standard method applied once; it is not a fitted quantity renamed as a prediction. The work is therefore self-contained as a data release and baseline evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset introduction paper with no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5397 in / 968 out tokens · 35256 ms · 2026-05-11T01:20:41.843721+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 8.0

VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
MASS: Motion-Aligned Selective Scan for Refinement in Flow-Based Video Frame Interpolation
cs.CV 2026-06 unverdicted novelty 7.0

MASS reformulates SSM-based feature scanning in flow-based VFI to follow dynamic motion trajectories via learnable path integration and velocity-aware sampling, claiming SOTA on challenging large-displacement cases.
Forget, Anticipate and Adapt: Test Time Training for Long Videos
cs.CV 2026-06 unverdicted novelty 7.0

FFN enables efficient TTT for long videos by operating on three frames and using a surprise-based adaptive window, shown on a new dataset of up to 3-hour videos for segmentation and classification tasks.
T-VSS: Test-Time Visual Subspace Steering for Adversarial Robustness of Vision-Language Models
cs.CV 2026-06 unverdicted novelty 7.0

T-VSS is a lightweight test-time defense that steers attacked visual features in VLMs using sample-specific low-rank subspaces and reliability-weighted entropy minimization to improve robustness.
Semantic Robustness Certification for Vision-Language Models
cs.LG 2026-06 unverdicted novelty 7.0

Framework certifies VLM robustness under semantic transformations via text prompt proxies, enabling quantitative certification of safe extent intervals without per-variation data.
A New Multi-Domain Benchmark for Micro-Action Recognition and Detection
cs.CV 2026-06 unverdicted novelty 7.0

MMA-82 is a multi-domain benchmark with 82 micro-action categories, 77,856 instances from 454 subjects, and protocols for recognition and multi-label detection tasks including cross-domain and few-shot settings.
FS-DVS: A Frequency-Selective Dynamic Visual Sensing Paradigm for Enhancing Information Completeness
cs.CV 2026-06 unverdicted novelty 7.0

FS-DVS inserts a learnable spatial filter before DVS event triggering; the filter converges to center-surround kernels that emphasize mid-spatial frequencies and improve downstream detection and recognition.
VidMsg: A Benchmark for Implicit Message Inference in Short Videos
cs.CV 2026-06 unverdicted novelty 7.0

VidMsg is a new benchmark dataset and QA/retrieval tasks for implicit message inference in short videos, where current models perform poorly.
Diffusing in the Right Space: A Systematic Study of Latent Diffusability
cs.CV 2026-06 unverdicted novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
cs.CV 2026-06 conditional novelty 7.0

Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.
An Attribute-Based Measure of Video Complexity
cs.CV 2026-05 unverdicted novelty 7.0

VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
USV: Towards Understanding the User-generated Short-form Videos
cs.CV 2026-05 unverdicted novelty 7.0

Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
PERL: Parameter Efficient Reasoning in CLIP Latent Space
cs.CV 2026-05 unverdicted novelty 7.0

PERL augments frozen CLIP with a shared recurrent reasoning module of roughly 6K parameters that iteratively refines representations via latent token injection, delivering strong base-to-novel and transfer performance...
Neutral-Reference Prompting for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base...
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
cs.CV 2026-05 unverdicted novelty 7.0

SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition
cs.CV 2026-05 conditional novelty 7.0

STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
cs.CV 2026-05 unverdicted novelty 7.0

CoDAAR creates a unified discrete representation space for multimodal sequences by aligning modality-specific codebooks through index-level semantic consensus, enabling both specificity and cross-modal generalization.
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
cs.CV 2026-05 unverdicted novelty 7.0

CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state...
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
cs.CV 2026-05 unverdicted novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 7.0

VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
E2E-WAVE: End-to-End Learned Waveform Generation for Underwater Video Multicasting
eess.SP 2026-04 unverdicted novelty 7.0

E2E-WAVE achieves +5 dB PSNR and real-time 16 FPS 128x128 video over 2.3 kbps underwater channels by learning waveforms that favor semantic similarity on decoding errors.
Inductive Convolution Nuclear Norm Minimization for Tensor Completion with Arbitrary Sampling
cs.CV 2026-04 unverdicted novelty 7.0

ICNNM reformulates CNNM using pre-learned shared convolution eigenvectors to bypass SVD computations, significantly reducing time while improving recovery performance for tensor completion with arbitrary sampling.
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
cs.AI 2026-04 unverdicted novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to reta...
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Improving Sparse Autoencoder with Dynamic Attention
cs.LG 2026-04 unverdicted novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
cs.CV 2026-04 unverdicted novelty 7.0

LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.
CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion
cs.CR 2026-04 unverdicted novelty 7.0

CLIP-Inspector reconstructs OOD triggers to detect backdoors in prompt-tuned CLIP models with 94% accuracy and higher AUROC than baselines, plus a repair step via fine-tuning.
InstrAct: Towards Action-Centric Understanding in Instructional Videos
cs.CV 2026-04 unverdicted novelty 7.0

InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on...
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
cs.CV 2026-04 unverdicted novelty 7.0

A framework that applies provenance-based guidance to input gradients during synthetic data training to promote learning from target regions only.
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
cs.CV 2026-03 unverdicted novelty 7.0

FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
Adapting MLLMs for Nuanced Video Retrieval
cs.CV 2025-12 unverdicted novelty 7.0

Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
cs.CV 2025-06 conditional novelty 7.0

SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
Zero-shot Concept Bottleneck Models
cs.LG 2025-02 unverdicted novelty 7.0

Z-CBMs achieve zero-shot interpretable predictions by retrieving concepts from a million-vocabulary web bank via cross-modal search and regressing labels with sparse linear regression.
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
cs.CV 2024-07 unverdicted novelty 7.0

OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
NetTailor: Tuning the Architecture, Not Just the Weights
cs.CV 2019-06 unverdicted novelty 7.0

NetTailor adapts CNN architecture for new tasks by assembling pre-trained universal blocks with task-specific layers, trained via activation mimicry and complexity penalties to match accuracy while reducing size for s...
The Kinetics Human Action Video Dataset
cs.CV 2017-05 accept novelty 7.0

Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
AdaBoosting Text Prompts for Vision-Language Models
cs.LG 2026-07 unverdicted novelty 6.0

TPB is an AdaBoost-style ensemble method for text prompts in VLMs that improves few-shot accuracy by targeting hard examples and maintains gains across model transfers.
Decompose, Compare, and Decide: Multimodal LLMs are Implicit Few-Shot Learners
cs.CV 2026-06 unverdicted novelty 6.0

DeCoDe decomposes few-shot classification into binary pairwise image comparisons whose affirmative logits serve as similarity scores, enabling strong performance from unmodified MLLMs on twelve datasets.
Forget, Anticipate and Adapt: Test Time Training for Long Videos
cs.CV 2026-06 unverdicted novelty 6.0

FFN performs TTT on multi-hour videos by restricting updates to three frames and using a surprise metric for adaptive window sizing, plus a new EpicTours dataset.
TACO: Towards Task-Consistent Open-Vocabulary Adaptation in Video Recognition
cs.CV 2026-06 unverdicted novelty 6.0

TACO achieves SOTA on video recognition benchmarks by regularizing relative representation geometry and decoupling optimization from test-time representations to address training-evaluation inconsistency in CLIP adaptation.
Modality-Aware Out-of-Distribution Detection for Multi-Modal Action Recognition
cs.CV 2026-06 unverdicted novelty 6.0

A modality-aware post-hoc detector for multi-modal OOD detection in action recognition combines uni-modal prediction relationships with feature-space scores and outperforms prior methods on the MultiOOD benchmark.
Black-Box Continual Learning for Vision-Language Models
cs.CV 2026-06 unverdicted novelty 6.0

Introduces Black-CL black-box benchmark and BETA textual-prototype method that matches or exceeds white-box continual learning performance on ten datasets using 0.05M parameters.
Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding
cs.CV 2026-06 unverdicted novelty 6.0

GPS framework adds self-guided reasoning modules to lightweight VLMs for fine-grained action understanding, claiming performance near GPT-4o with better factual accuracy on a custom CAP-based dataset.
TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living
cs.CV 2026-06 unverdicted novelty 6.0

TimeProVe proposes a propose-then-verify framework using lightweight action-based candidate evidence generation followed by targeted VLM verification for efficient long video temporal reasoning, achieving 7.3% improve...
TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization
cs.CV 2026-06 unverdicted novelty 6.0

TivTok factorizes video clips into reusable time-invariant tokens and frame-specific time-variant tokens via Scope-Induced Factorization and Invariant Broadcasting, achieving 2.91x better compression for 128-frame vid...
RepWAM: World Action Modeling with Representation Visual-Action Tokenizers
cs.CV 2026-06 unverdicted novelty 6.0

RepWAM introduces representation visual-action tokenizers to pretrain world action models that jointly model future visual states and latent actions under instructions for improved robot manipulation.
Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration
cs.RO 2026-06 unverdicted novelty 6.0

DAR-Net applies transformer temporal reasoning with pixel-level semantic supervision to classify six diver activities on a new 2,600-image UDA dataset, reporting better accuracy than prior models in controlled tests.
Information-Theoretic Decomposition for Multimodal Interaction Learning
cs.LG 2026-06 unverdicted novelty 6.0

DMIL is a multimodal learning framework that decomposes sample-specific interactions into redundant, unique, and synergistic components via variational architecture and uses them for adaptive fine-tuning.
Hybrid Robustness Verification for Spatio-Temporal Neural Networks
cs.CV 2026-06 unverdicted novelty 6.0

STBP computes exact closed-form bounds for the first convolutional layer of spatio-temporal networks and propagates scalable approximations through the rest to certify robustness under subset-frame or patch perturbations.
Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting
cs.CV 2026-06 unverdicted novelty 6.0

A parameter-free approach drops redundant video tokens via temporal L1 differences in frozen latent space and reconstructs them with LIT, yielding 31x speedup over ElasticTok-CV on TokenBench and DAVIS.
Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models
cs.CV 2026-06 unverdicted novelty 6.0

GPUA learns an orthogonal mapping from VFM to VLM feature space to preserve geometry and improve cross-model compatibility for zero-shot recognition and segmentation.
Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals
cs.LG 2026-06 unverdicted novelty 6.0

A pre-fusion calibration module modulates multimodal features using cross-modality support and conflict cues to improve performance on five benchmarks including sentiment analysis and audio-visual tasks.
AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning
cs.CV 2026-05 unverdicted novelty 6.0

AREA stabilizes attribute extraction with principal geodesic analysis on hyperspherical space and aggregation with lightweight task experts plus variational bottleneck and optimal transport routing, outperforming SOTA...
Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification
cs.CV 2026-05 unverdicted novelty 6.0

Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.
Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers
cs.CV 2026-05 unverdicted novelty 6.0

Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.
Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Closed-Loop Bidirectional Prompting with Semantic Anchor for cross-modal agreement recovery, claiming SOTA adversarial robustness and generalization on 11 datasets.
UAV-OVO: Out-of-Viewpoint Generalization in UAV Action Recognition
cs.CV 2026-05 unverdicted novelty 6.0

UAV-OVO benchmark exposes large ID/OOD performance gaps in video action recognition due to low-to-high depression viewpoint shifts, and LATER uses LoRA subspace anchoring for test-time feature re-centering to reduce drift.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 147 Pith papers

[1]

http://codecguide.com/

K-lite codec package. http://codecguide.com/. 4

work page
[2]

http://www.youtube.com/

Youtube. http://www.youtube.com/. 4

work page
[3]

Blank, L

M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes, 2005. International Confer- ence on Computer Vision (ICCV). 2, 6

work page 2005
[4]

Johansson, S

G. Johansson, S. Bergstrom, and W. Epstein. Perceiving events and objects, 1994. Lawrence Erlbaum Associates. 2

work page 1994
[5]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recogni- tion, 2011. International Conference on Computer Vision (ICCV). 2, 6

work page 2011
[6]

J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos in the wild, 2009. IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR). 2, 6

work page 2009
[7]

Marszaek, I

M. Marszaek, I. Laptev, and C. Schmid. Actions in context,

work page
[8]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2, 5, 6

work page
[9]

Niebles, C

J. Niebles, C. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity clas- sication, 2010. European Conference on Computer Vision (ECCV). 2, 6

work page 2010
[10]

Reddy and M

K. Reddy and M. Shah. Recognizing 50 human action cat- egories of web videos, 2012. Machine Vision and Applica- tions Journal (MV AP). 2, 6

work page 2012
[11]

Rodriguez, J

M. Rodriguez, J. Ahmed, and M. Shah. Action mach: A spatiotemporal maximum average correlation height lter for action recognition, 2008. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2, 6

work page 2008
[12]

Schuldt, I

C. Schuldt, I. Laptev, and B. Caputo. Recognizing human ac- tions: A local svm approach, 2004. International Conference on Pattern Recognition (ICPR). 2, 6

work page 2004
[13]

Weinland, E

D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views using 3d exemplars, 2007. International Conference on Computer Vision (ICCV). 2, 6 Archery Baseball Pitch Basketball Dunk Biking Bowling Boxing Speed Bag Clean and Jerk Cricket Bowling Diving Field Hockey Penalty Frisbee Catch Golf Swing High Jump Horse Riding Javelin Throw Lon...

work page 2007

[1] [1]

http://codecguide.com/

K-lite codec package. http://codecguide.com/. 4

work page

[2] [2]

http://www.youtube.com/

Youtube. http://www.youtube.com/. 4

work page

[3] [3]

Blank, L

M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes, 2005. International Confer- ence on Computer Vision (ICCV). 2, 6

work page 2005

[4] [4]

Johansson, S

G. Johansson, S. Bergstrom, and W. Epstein. Perceiving events and objects, 1994. Lawrence Erlbaum Associates. 2

work page 1994

[5] [5]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recogni- tion, 2011. International Conference on Computer Vision (ICCV). 2, 6

work page 2011

[6] [6]

J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos in the wild, 2009. IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR). 2, 6

work page 2009

[7] [7]

Marszaek, I

M. Marszaek, I. Laptev, and C. Schmid. Actions in context,

work page

[8] [8]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2, 5, 6

work page

[9] [9]

Niebles, C

J. Niebles, C. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity clas- sication, 2010. European Conference on Computer Vision (ECCV). 2, 6

work page 2010

[10] [10]

Reddy and M

K. Reddy and M. Shah. Recognizing 50 human action cat- egories of web videos, 2012. Machine Vision and Applica- tions Journal (MV AP). 2, 6

work page 2012

[11] [11]

Rodriguez, J

M. Rodriguez, J. Ahmed, and M. Shah. Action mach: A spatiotemporal maximum average correlation height lter for action recognition, 2008. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2, 6

work page 2008

[12] [12]

Schuldt, I

C. Schuldt, I. Laptev, and B. Caputo. Recognizing human ac- tions: A local svm approach, 2004. International Conference on Pattern Recognition (ICPR). 2, 6

work page 2004

[13] [13]

Weinland, E

D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views using 3d exemplars, 2007. International Conference on Computer Vision (ICCV). 2, 6 Archery Baseball Pitch Basketball Dunk Biking Bowling Boxing Speed Bag Clean and Jerk Cricket Bowling Diving Field Hockey Penalty Frisbee Catch Golf Swing High Jump Horse Riding Javelin Throw Lon...

work page 2007