Abstract for this paper: Action Recognition is an important field in various vision tasks. Before this paper, most works were based on something other than Deep Learning, and although some papers tried using CNN, they couldn’t perform comparably. This paper proposes a Two-Stream ConvNet which introduces optical flow in architecture. This model is trained and evaluated on the UCF-101 and HMDB-51 and achieved state-of-the-art during that time.


Inspiration: Video can naturally be divided into spatial and temporal components. The spatial part is many individual single-frames that contain the scenes and objects. And the temporal part focuses on the motion and the movement.

Optical Flow

Optical Flow is a pattern to describe the motion that happens between the adjacent frames. In this paper, the authors introduce various optical flow structures as the input of the Temporal Stream.

Optical Flow Stacking: The dense optical flow can be defined as the displacement vector between adjacent frames and . The can be decomposed into horizontal and vertical components and , thus the Optical Flow can be represented as two-channel images. For the typical frame , the input volume can be constructed as follows:

Optical Flow

Trajectory stacking: The previous Optical flow focuses on the motion in the particle position in images, while this method tries to use the trajectories in reality. Support that is the -th point along the trajectory and , the input volume can be formed as follows:

Trajectory Optical Flow

Bi-directional optical flow: To obtain the displacement fields in the opposite direction, the authors use the forward flows which are computed between frames and and backward flows that are computed between frames and . Finally, the input volume’s channels are also as before.

Finally, Consider the movement from the camera and other motions that will affect the global images, each displacement field Subtract the mean vector.

Two-stream Convolutional Networks

Spatial Stream. Some semantic information, such as particular objects and scenes, is important for the model to recognize the action. For example, the people who hold the cup are more likely to drink. Thus, the model needs to learn the semantic information from the input videos. Fortunately, various Deep-Learning models have already been proposed to realize the static images, such as AlexNet trained on ImageNet at that time. The Spatial Stream uses a similar architecture as AlexNet to accept the single frames and realize the semantic information.

Temporal Stream. According to the previous explanation about optical flow, the input of the temporal stream net is a sub-volume from the . In the past, many manual features were computed from the optical flow, such as kinematic features, this work uses CNN to replace the manual operation and obtain features automatically.

Two-Stream ConvNet Architecture

Fusion Methods. The outputs from two streams can be fused by various methods. This paper tries averaging and training a multi-class linear SVM.



Training strategy

  • mini-batch SGD with momentum(set to 0.9)
  • Spatial Net: sub-image cropped from the whole frame and random horizontal flipped.
  • Temporal Net: input volume that is randomly cropped and flipped.

Learning rate schedule

  • Initially set to

  • When training from scratch, the rate is changed to after 50K iterations, then to after 70K iterations.

  • In the fine-tuning scenario, the rate is changed to after 14K iterations, and stop after 20K iterations.

Multi-task learning: The network is trained on the UCF-101 and HMDB-51 datasets based on multi-task learning. The model has two softmax layers on top of the fully connected layer, providing two scores, and each of them has its own loss function.


Two-Stream ConvNet accuracy on UCF-101
Mean Accuracy on UCF-101 and HMDB-51


[1] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” Neural Information Processing Systems,Neural Information Processing Systems, Dec. 2014.