How to build a dataset

  1. Define an action list, by combining labels from the previous action list and adding more categories depending on the use case.
  2. Obtain videos from various sources, such as movies or streaming media.
  3. Provide temporal annotations manually.
  4. Clean up the dataset by de-duplication and filtering

Datasets list

Datasets list
  • HMDB51: It was collected mainly from movies, and other public databases, such as YouTube and Google videos. The dataset contains 6849 clips divided into 51 action categories.
  • UCF101: It contains 13320 videos spreading over 101 categories.
  • Sports1M: First large-scale video action dataset which contains more than 1 million videos and has 487 sports classes.
  • ActivityNet: It has 200 human daily living actions and 10024 training, 4926 validation, and 5044 testing videos.
  • YouTube8M: It contains 8 million YouTube videos (500K hours of video in total) annotated with 3862 action classes. Some clips are annotated with multiple labels by a YouTube video annotation system.
  • Charades: It contains 9848 videos with an average length of 30 seconds. This dataset includes 157 multi-label daily indoor activities, performed by 267 different people.
  • Kinetics Family: It is the most widely adopted benchmark containing 240k training and 20k validation videos, each video is trimmed to 10 seconds from 400 human action categories.
  • 20BN-Something-Something: A popular benchmark that consists of 174 action classes that describe humans performing basic actions with everyday objects.
  • AVA: It was the first large-scale spatiotemporal action detection dataset**. It contains 430 15-minute video clips with 80 atomic action labels. Recently, this dataset has been expanded to AVA-Kinetics.
  • Moments in Time: It is a large-scale dataset designed for event understanding, containing one million 3-second video clips, annotated with 339 classes. This dataset includes people, animals, objects, and natural phenomena.
  • HACS: It is the large-scale dataset for recognition and localization of human actions, containing 1.55M 2-second clip annotations on 504 videos.
  • HVU: This dataset has 572K videos and 3142 labels are used for multi-label multi-task video understanding. It was divided into six task categories: scene, object, action, event, attribute, and concept.
  • AViD: This dataset is used for anonymized action recognition, which removes face identities.


[1] Y. Zhu et al., ‘A Comprehensive Study of Deep Video Action Recognition’, arXiv [cs.CV]. 2020.