Abstract Information: Animal action recognition is really important for various fields, such as animal behavior science and the protection and management of wildlife. Meanwhile, various action recognition methods based on Deep Learning request bigger datasets to train. Animal Kingdom which is a large dataset containing many species in different scenes and weather is proposed to solve this problem. This dataset also presents three interesting tasks to solve, including action recognition, video grounding, and pose estimation.

Previous Dataset’s Weakness

  • The dataset size is small, and the types of annotations are not sufficient
  • The types of animals are not detailed enough, and most datasets are usually only targeted at one specific task
  • Not entirely sufficient for the growth stage of animals
  • The distribution of environmental types is not uniform, and most of them can only be concentrated in a few special environments
  • Datasets are usually only suitable for specific tasks and cannot utilize the mutual assistance of multiple tasks to achieve better performance
Previous Datasets

Animal Kingdom

The dataset contains life segments of 850 species under 6 key categories in different scenarios. The environment, weather, and perspective have all been diversified and combined with fine-grained multi-labels.

Animal class

Action Recognition

Task Introduction: The behavior recognition task takes a video as input and provides the category of behavior. Dataset content: 50 hours of animal video clips collected from YouTube, with an average length of 6 seconds and a range of 1 to 117 seconds. The dataset contains 850 species and annotates 140 fine-grained behaviors, ranging from short-term (long jump) to long-term (sexual behavior), including life events, daily events, and social events.



  • There are numerous intra class and inter class diversity differences, where the same behavior is represented by different animals or the same animal performs differently in different scene environments
  • In nature, the behavior of animals is diverse, and usually there is more than one manifestation of behavior.
  • Many behaviors, such as eating, occur much more frequently than other behaviors, leading to the possibility of long tails in the data. Therefore, the dataset is divided into three major parts
    • Head behavior: 17 action classes, each with over 500 samples, such as perception, eating, etc.
    • Middle behavior: 29 action classes with 100-500 samples, such as climbing, digging, etc.
    • Tail behaviors: 94 action classes, less than 100 samples, such as conditioning, attacking, etc
Results of action recognition

Video Grounding

Task Introduction: By entering a sentence describing the scene and behavior, the model provides the start and end times of relevant segments to achieve the function of a video cropping engine.

Video Grounding

Dataset Content: Contains 50 hours of video content, with a total of 4301 long videos and 188744 annotated sentences. Each video usually has 3-5 sentences. Difficulty: Identify animals and behaviors from complex backgrounds and associate them with text sentences.

Results of video grounding

Pose Estimation

Task Description: Receive animals as input images and predict their joint positions.


Dataset content: Contains 33099 images and corresponding animal joints, established for 5 different main categories. A total of 23 key points are defined, including: 1 head, 2 eyes, 4 mouth, 2 shoulders, 2 elbows, 2 wrists, 1 middle trunk, 2 buttocks, 2 knees, 2 ankles, and 3 tail parts. Difficulties: Different animals are not completely unified.

Results of pose estimation