{"work":{"id":"c8a3de61-cfd3-4aeb-bcf7-a0372c015748","openalex_id":null,"doi":null,"arxiv_id":"1705.06950","raw_key":null,"title":"The Kinetics Human Action Video Dataset","authors":null,"authors_text":"Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan","year":2017,"venue":"cs.CV","abstract":"We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset. We also carry out a preliminary analysis of whether imbalance in the dataset leads to bias in the classifiers.","external_url":"https://arxiv.org/abs/1705.06950","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T15:17:01.836953+00:00","pith_arxiv_id":"1705.06950","created_at":"2026-05-09T04:17:20.426003+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"The Kinetics Human Action Video Dataset","render_title":"The Kinetics Human Action Video Dataset"},"hub":{"state":{"work_id":"c8a3de61-cfd3-4aeb-bcf7-a0372c015748","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":110,"external_cited_by_count":null,"distinct_field_count":7,"first_pith_cited_at":"2019-06-27T03:04:16+00:00","last_pith_cited_at":"2026-05-22T07:01:23+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-06T08:40:32.916231+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"dataset","n":16},{"context_role":"background","n":9}],"polarity_counts":[{"context_polarity":"use_dataset","n":15},{"context_polarity":"background","n":10}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"The Kinetics Human Action Video Dataset","claims":[{"claim_text":"We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action class","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"as FID [187] and CLIP Score [188], with evaluation conducted on datasets including COCO-30K [89] and ImageNet [90]. Video generation usually uses feature methods like I3D [189] , with evaluation conducted on Kinetics-400 [190]. (2) Consistency evaluationexamines the logical coherence of generated content under different con- ditions. This includes identity preservation tests on CelebA-HQ [191], action consistency tests on UCF- 101 [192], and object state change evaluation on the Something-Someth","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Evaluation Metrics.We evaluate the reconstructed videos at semantic, spatiotemporal, and pixel levels [20, 25]. For semantic-level evaluation, we computeN-way top- Kaccuracy to assess whether the generated videos seman- tically match the ground-truth (GT) clips, using a Video- MAE [92]-based classifier on 400 video classes from the Kinetics-400 dataset [42], following prior work [20, 25, 98]. For spatiotemporal-level evaluation, we use CLIP temporal consistency (CLIP-pcc) [77] and DINO [72] temp","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"V AE [27], a 3D causal convolutional architecture employed in Open-Sora Plan 1.5 [ 28]; (2) Open-Sora V AE [55]; (3) CV-V AE [54]; (4) OD-V AE [8] used in Open-Sora Plan 1.2 [28]; (5) SVD-V AE [3], which operates without temporal compression; and (6) SD-V AE [34], a widely-adopted image V AE baseline. Datasets and Evaluation Metrics.We utilize the Kinetics- 400 dataset [21] for model training and validation. For eval- uation, we perform zero-shot testing on the Panda-70M [9] and WebVid-10M [2] d","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"OrderSelf-Similarity), that learns distinct representations of STSSs at diverse orders and integrates them into holistic motion features. The proposed module is lightweight and can be easily integrated into existing video architectures, enhancing temporal modeling capabilities across various domains (Fig. 1b). We first evaluate our method on diverse action recognition benchmarks,i.e., Kinetics-400 [31], Something-Something V1 & V2 [21,54], Diving48 [43], and FineGym [64], demonstrating significa","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Thus, inference explicitly couples the two sides of our bridge: zb carries pose-anchored semantics recovered from the skeleton extraction process, while ˜t provides skeleton-compatible semantic targets, jointly reducing the skeleton-text semantic gap. 4 Experiments 4.1 Experimental Setup Datasets. We evaluate PoseBridge on NTU-RGB+D 60 [ 24], NTU-RGB+D 120 [20], PKU-MMD [18], and Kinetics-200/400 [12, 35]. NTU-RGB+D 60/120 and PKU-MMD are controlled RGB-D skeleton action benchmarks, while Kineti","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Methods for this benchmark are conditioned on 1 frame and generate the next 15. Results are listed in Table 2. Following the evaluation protocol of [4] and others, we calculate FVD [54] using the I3D network [8] by comparing 100 × 256 model samples against the 256 examples in the evaluation set. Kinetics-600 We additionally evaluate video prediction performance on the Kinetics-600 bench- mark [27, 9]. Kinetics-600 contains approximately 400 thousand training videos depicting 600 different activi","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks The Kinetics Human Action Video Dataset because it crossed a citation-hub threshold. Current citing contexts most often use it as dataset evidence (14 contexts).","role_counts":[{"n":14,"context_role":"dataset"},{"n":8,"context_role":"background"}]},"error":null,"updated_at":"2026-05-24T04:55:02.248056+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"ad4affaf-f7fb-4886-b0b3-12220f4ac79c","orcid":null,"display_name":"Will Kay"},{"id":"2599c0ce-f4f9-4761-881d-46343db0309f","orcid":null,"display_name":"Joao Carreira"},{"id":"5edadd6e-83fa-4aca-aa47-0fb277e2577e","orcid":null,"display_name":"Karen Simonyan"},{"id":"af7bf6ee-abef-4186-b3f5-ffe6646abd2c","orcid":null,"display_name":"Brian Zhang"},{"id":"9bd3eb65-88aa-41b5-b973-2a595517f8c3","orcid":null,"display_name":"Chloe Hillier"},{"id":"71be09c6-c7a7-4404-8e6e-af33e3e2f82f","orcid":null,"display_name":"Sudheendra Vijayanarasimhan"}]},"error":null,"updated_at":"2026-05-24T04:55:02.242190+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T12:00:04.055653+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":14},{"title":"UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild","work_id":"5dfb46e7-e952-409d-a3c7-ba7f20aebad6","shared_citers":14},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":10},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":8},{"title":"A short note about kinetics-600","work_id":"851b1623-6feb-441e-8849-b07f1753f22e","shared_citers":6},{"title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","work_id":"a9c28401-f16a-4933-89f0-788e2f94e52b","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":5},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":5},{"title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis","work_id":"77fd5ac9-ae98-4846-9d83-e9c73c8f2a52","shared_citers":5},{"title":"Coca: Contrastive captioners are image-text foundation models","work_id":"5dd5bf10-d548-40ff-9b6c-6735129b27ee","shared_citers":4},{"title":"Flamingo: a Visual Language Model for Few-Shot Learning","work_id":"a110f764-38dc-41b2-a802-53744ecea1fc","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"Internvid: A large-scale video-text dataset for multimodal understanding and generation","work_id":"e0787102-2f18-46ad-9660-6bfe466e3bbf","shared_citers":4},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":4},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":4},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":4},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":4},{"title":"Representation Learning with Contrastive Predictive Coding","work_id":"7b08a1d4-d565-424e-9c86-6ef244b7b90a","shared_citers":4},{"title":"Revisiting Feature Prediction for Learning Visual Representations from Video","work_id":"f7251dcf-5341-4915-bfe7-27812387b61a","shared_citers":4},{"title":"SGDR: Stochastic Gradient Descent with Warm Restarts","work_id":"ad476478-c5ea-495b-a454-168c504bbfcc","shared_citers":4},{"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","shared_citers":4},{"title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","work_id":"38f52461-37fd-4266-bc46-9dea31be2824","shared_citers":4},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":4},{"title":"World Models","work_id":"07227eee-8445-4c98-bce4-c6a6fd5ed907","shared_citers":4}],"time_series":[{"n":2,"year":2022},{"n":3,"year":2023},{"n":4,"year":2024},{"n":2,"year":2025},{"n":45,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T11:59:53.331760+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T11:59:55.780857+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"The Kinetics Human Action Video Dataset","claims":[{"claim_text":"We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action class","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"as FID [187] and CLIP Score [188], with evaluation conducted on datasets including COCO-30K [89] and ImageNet [90]. Video generation usually uses feature methods like I3D [189] , with evaluation conducted on Kinetics-400 [190]. (2) Consistency evaluationexamines the logical coherence of generated content under different con- ditions. This includes identity preservation tests on CelebA-HQ [191], action consistency tests on UCF- 101 [192], and object state change evaluation on the Something-Someth","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Evaluation Metrics.We evaluate the reconstructed videos at semantic, spatiotemporal, and pixel levels [20, 25]. For semantic-level evaluation, we computeN-way top- Kaccuracy to assess whether the generated videos seman- tically match the ground-truth (GT) clips, using a Video- MAE [92]-based classifier on 400 video classes from the Kinetics-400 dataset [42], following prior work [20, 25, 98]. For spatiotemporal-level evaluation, we use CLIP temporal consistency (CLIP-pcc) [77] and DINO [72] temp","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"V AE [27], a 3D causal convolutional architecture employed in Open-Sora Plan 1.5 [ 28]; (2) Open-Sora V AE [55]; (3) CV-V AE [54]; (4) OD-V AE [8] used in Open-Sora Plan 1.2 [28]; (5) SVD-V AE [3], which operates without temporal compression; and (6) SD-V AE [34], a widely-adopted image V AE baseline. Datasets and Evaluation Metrics.We utilize the Kinetics- 400 dataset [21] for model training and validation. For eval- uation, we perform zero-shot testing on the Panda-70M [9] and WebVid-10M [2] d","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"OrderSelf-Similarity), that learns distinct representations of STSSs at diverse orders and integrates them into holistic motion features. The proposed module is lightweight and can be easily integrated into existing video architectures, enhancing temporal modeling capabilities across various domains (Fig. 1b). We first evaluate our method on diverse action recognition benchmarks,i.e., Kinetics-400 [31], Something-Something V1 & V2 [21,54], Diving48 [43], and FineGym [64], demonstrating significa","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Thus, inference explicitly couples the two sides of our bridge: zb carries pose-anchored semantics recovered from the skeleton extraction process, while ˜t provides skeleton-compatible semantic targets, jointly reducing the skeleton-text semantic gap. 4 Experiments 4.1 Experimental Setup Datasets. We evaluate PoseBridge on NTU-RGB+D 60 [ 24], NTU-RGB+D 120 [20], PKU-MMD [18], and Kinetics-200/400 [12, 35]. NTU-RGB+D 60/120 and PKU-MMD are controlled RGB-D skeleton action benchmarks, while Kineti","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Methods for this benchmark are conditioned on 1 frame and generate the next 15. Results are listed in Table 2. Following the evaluation protocol of [4] and others, we calculate FVD [54] using the I3D network [8] by comparing 100 × 256 model samples against the 256 examples in the evaluation set. Kinetics-600 We additionally evaluate video prediction performance on the Kinetics-600 bench- mark [27, 9]. Kinetics-600 contains approximately 400 thousand training videos depicting 600 different activi","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks The Kinetics Human Action Video Dataset because it crossed a citation-hub threshold. Current citing contexts most often use it as dataset evidence (14 contexts).","role_counts":[{"n":14,"context_role":"dataset"},{"n":8,"context_role":"background"}]},"error":null,"updated_at":"2026-05-24T04:55:02.251989+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"The Kinetics Human Action Video Dataset","claims":[{"claim_text":"We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action class","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks The Kinetics Human Action Video Dataset because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T11:59:50.777871+00:00"}},"summary":{"title":"The Kinetics Human Action Video Dataset","claims":[{"claim_text":"We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action class","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks The Kinetics Human Action Video Dataset because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":14},{"title":"UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild","work_id":"5dfb46e7-e952-409d-a3c7-ba7f20aebad6","shared_citers":14},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":10},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":8},{"title":"A short note about kinetics-600","work_id":"851b1623-6feb-441e-8849-b07f1753f22e","shared_citers":6},{"title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","work_id":"a9c28401-f16a-4933-89f0-788e2f94e52b","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":5},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":5},{"title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis","work_id":"77fd5ac9-ae98-4846-9d83-e9c73c8f2a52","shared_citers":5},{"title":"Coca: Contrastive captioners are image-text foundation models","work_id":"5dd5bf10-d548-40ff-9b6c-6735129b27ee","shared_citers":4},{"title":"Flamingo: a Visual Language Model for Few-Shot Learning","work_id":"a110f764-38dc-41b2-a802-53744ecea1fc","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"Internvid: A large-scale video-text dataset for multimodal understanding and generation","work_id":"e0787102-2f18-46ad-9660-6bfe466e3bbf","shared_citers":4},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":4},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":4},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":4},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":4},{"title":"Representation Learning with Contrastive Predictive Coding","work_id":"7b08a1d4-d565-424e-9c86-6ef244b7b90a","shared_citers":4},{"title":"Revisiting Feature Prediction for Learning Visual Representations from Video","work_id":"f7251dcf-5341-4915-bfe7-27812387b61a","shared_citers":4},{"title":"SGDR: Stochastic Gradient Descent with Warm Restarts","work_id":"ad476478-c5ea-495b-a454-168c504bbfcc","shared_citers":4},{"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","shared_citers":4},{"title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","work_id":"38f52461-37fd-4266-bc46-9dea31be2824","shared_citers":4},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":4},{"title":"World Models","work_id":"07227eee-8445-4c98-bce4-c6a6fd5ed907","shared_citers":4}],"time_series":[{"n":2,"year":2022},{"n":3,"year":2023},{"n":4,"year":2024},{"n":2,"year":2025},{"n":45,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"af7bf6ee-abef-4186-b3f5-ffe6646abd2c","orcid":null,"display_name":"Brian Zhang","source":"manual","import_confidence":0.72},{"id":"9bd3eb65-88aa-41b5-b973-2a595517f8c3","orcid":null,"display_name":"Chloe Hillier","source":"manual","import_confidence":0.72},{"id":"2599c0ce-f4f9-4761-881d-46343db0309f","orcid":null,"display_name":"Joao Carreira","source":"manual","import_confidence":0.72},{"id":"5edadd6e-83fa-4aca-aa47-0fb277e2577e","orcid":null,"display_name":"Karen Simonyan","source":"manual","import_confidence":0.72},{"id":"71be09c6-c7a7-4404-8e6e-af33e3e2f82f","orcid":null,"display_name":"Sudheendra Vijayanarasimhan","source":"manual","import_confidence":0.72},{"id":"ad4affaf-f7fb-4886-b0b3-12220f4ac79c","orcid":null,"display_name":"Will Kay","source":"manual","import_confidence":0.72}]}}