Abstract for this paper: Video-based tasks such as Action Recognition rely on long-range temporal information. Some methods use LSTM or other RNNs after feature extraction to utilize this information, but it will add more compute costs. Furthermore, the dominant end-to-end CNN model in video-based tasks is still lacking. This paper proposes TSN, which is an effective and efficient video-level framework for learning video representation that can capture long-range temporal structure. After that, it explores a series of good practices to learn the ConvNet models given limited training samples.


Temporal Segment Networks

Structure: Given video , the model will divided into segments , and randomly sample snippets from corresponding segments . is the CNN having the parameters , which accept as input and produces class scores. can combine the output scores from and obtain a consensus on the class hypothesis. Authors test several functions, such as maximum, averaging, weighted averaging, and finally use averaging as . Finally, the function , normally using the Softmax, can predict the action class for the whole video. The overall framework is formed as Temporal Segment Networks

To train this model, the loss function is formed as: where , is the groundtruth label, is the number of action classes. This loss function can be used to derive the corresponding gradient for the model as follows: where is the number of segments, set as 3 during the experiments. This formula reveals that the CNN’s parameters are influenced by all video segments rather than a particular short one.

Useful Practices

To train the TSN optimally in the limited dataset, the authors explore several useful practices for improving the performance.

Network Architectures

Several works have shown that deeper structures improve object recognition performance, thus this paper employed Inception with Batch Normalization as the base of the two-stream network. The spatial stream operates on a single RGB image, and the temporal stream takes a stack of consecutive optical flow as input.

Network Inputs

This paper explores more input modalities in addition to normal two-stream network input to enhance the power of models.

  • RGB Difference. Normally, the single RGB images only contain spatial information, but the context is still important for video-based tasks. Thus, this paper adds stacked RGB difference as another input modality.
  • Warped optical flow fields. The motion of the camera or background will cause the optical flow can’t concentrate on the actors. Thus, the authors extract the warped optical flow by first estimating the homography matrix and then compensating the camera motion.
Examples of four types of input modality: RGB images, RGB difference, optical flow fields (x,y directions), and warped optical flow fields (x,y directions)
Result of different input modalities

Network Training

Because the datasets for action recognition are relatively small to train the large CNNs, this paper proposes several strategies for training the TSN.

  • Cross Modality Pre-training

    • Spatial Networks: this stream taking RGB images as input can be trained on the ImageNet initially.
    • Temporal Networks: this stream can be initialized by the RGB models by following strategies: 1) discretize optical flow fields into the interval from 0 to 255. 2) modify the weights of first convolution layer by averaging the weights across the RGB channels and replicate this.
  • Regularization Techniques

    • Freezing the mean and variance parameters of all Batch Normalization layers expect the first one after initialization, and re-estimate the mean and variance of first one.
    • Add an extra droppout layer after the global polling layer to reduce the over-fitting. The dropout ratio is set as 0.8 for spatial stream and 0.7 for temporal stream.
  • Data Augmentation

    Beside the random cropping and horizontal flipping that are already used in two-steam ConvNets, this paper exploit two new data augmentation techniques.

    • Corner cropping technique: Only crop the region from the corners or the center to avoid implicitly focusing on the center area.
    • Scale jittering: Fix the size of input images as , and cropped width and height are randomly selected from . Finally, the input will be resized to .

Finally, the benefits of these methods are represented as follows:

Exploration of different training strategies for two-stream ConvNets on the UCF101 dataset



Dataset: HMDB51, UCF101

Data Augmentation: location jittering, horizontal flipping, corner cropping, scale jittering.

Training Strategies: SGD, bath size is 256, momentum is 0.9

  • Spatial Networks: lr is initialized as 0.001 and divide 10 every 2000 iterations. The training stop at 4500 interactions.
  • Temporal Networks: lr is initialized as 0005 and divide 10 after 12000 and 18000 iterations. The training stop at 200000 iterations.


Exploration of different segmental consensus function
Exploration of different networks
Component analysis of the proposed methods
Comparison of others


[1] L. Wang et al., “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” in Computer Vision – ECCV 2016, Lecture Notes in Computer Science, 2016, pp. 20–36. doi: 10.1007/978-3-319-46484-8_2.