Abstract for this paper: In the past, most Action Recognition methods are based on manual features. Although some models use deep learning, they just use 2D ConvNets. This paper explores various 3D convolution kernels having different depths. After that, it proposes a C3D model that can perform well after being trained in a large dataset. The authors use C3D as a feature extractor in four tasks and 6 benchmarks, all of which achieve comparable results.

2D/3D convolution operations

Methods

Base Information

Requirements for an effective video descriptor

  • Generic: it can represent different types of videos well.
  • Compact: it helps processing, storing, and retrieving tasks more scalable when dealing with millions of videos.
  • Efficient: it can finish the computation in a limited time.
  • Simple: it needs easy to implement.

Generic setting

  • Input: video clips with the size of .
    • is the number of channels, is the number of frames and are the width and height of the frame.
    • During exploration, the input size is from UCF-101.
  • Network Structure: 5 convolution layers and 5 pooling layers, 2 fully-connected layers, and a softmax loss layer.
    • The number of filters for 5 convolution layers is 64,128,256,256, and each filter size is . Using appropriate padding and stride 1.
    • First pooling layer using kernels with the size of , and other pooling layers are max pooling with kernel size .
    • Two fully connected layers have 2048 outputs.
  • Training Strategy: The learning rate is initial as 0.003, and divided by 10 after every 4 epochs. The training is stopped after 16 epochs.

Exploration of the kernel depths

Depth setting: authors use two strategies to set the depth of kernels

  • Homogeneous temporal depth: 1,3,5,7
  • Varying temporal depth
    • Increasing:3-3-5-5-7
    • Decreasing:7-5-5-3-3

The actors prove that the parameter number is merely simple, thus it has no significant influence on the results.

Results: Depth-3 performs best among various kernels, and depth-1 is significantly worse than others. More details are shown in the following figure.

Depth searching results

Thus, the kernels of is the best option for 3D ConvNets.

C3D

According to the previous conclusion, this paper proposes the network named C3D.

C3D model structures
  • All convolution filters are with stride .
  • The first pooling layer uses kernel and stride . Others use kernel and stride
  • Each fully connected layer has 4096 output units.

Experiments

Training

Dataset: The C3D is trained on the Sport-1M dataset which is the largest video classification benchmark during that time. Each video belongs to one of the 487 sports categories. During data preparation, 2-second long clips are extracted randomly and resized to . After that, the input volumes will be randomly clipped into for jittering.

Training strategy: SGD with 30 bath size. The initial learning rate is 0.003 and is divided by 2 every 150k interactions. The training stops at 1.9M interactions.

Sports-1M classification result

Various application scenarios

The C3D can be used as a feature extractor by using the output from fc6 after L2- normalization. Some application scenarios are shown as follows.

Action Recognition

In this scenario, the authors use the C3D feature as input to train multi-class linear SVM on UCF-101. The results compared to other baselines are shown as follows.

Action recognition results on UCF101

In this table, Imagenet is the popular deep image feature extractor at that time, and C3D (3 nets) means the network

  • Trained on I380K
  • Trained on Sports-1M
  • Trained on I380K and fine-tuned on Sports-1M

Action Similarity Labeling

This task focuses on predicting action similarity rather than the actual action label using the ASLAN dataset. The authors use prob, fc7, fc6, pool5 as features for each video clip.

Action similarity labeling result

Scene and Object Recognition

The authors slide a window of 16 frames over all videos to extract C3D features. They train and test C3D features using linear SVM and report the object recognition accuracy.

Scene recognition accuracy

Runtime Analysis

Runtime analysis on UCF101

Reference

[1] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” in 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, Dec. 2015. doi: 10.1109/iccv.2015.510.