Abstract for this paper: This paper focuses on introducing Transformer architecture to video recognition to replace 3D CNN due to various benefits. They first used a complete formula to derive the method of calculating attention and building a model in video situations. Then, to reduce the computational cost, they try several attention schemes and finally find divided space-time attention to achieve the best accuracy.


Inspiration: Videos and sentences are both sequential. And they both need context to reduce ambiguity. And other previous work, such as Non-local, already uses self-attention.


  • Transformers impose less restrictive inductive biases than CNN, which means they can represent more functions.
  • The transformer captures the information from the global without the limitation from the receptive field.
  • Transformer is faster when training and inference.


Transformer in Video

Input vectors

The original input is a series of RGB frames , where is the height, weight, and frame number. And the same as ViT, each frame will be decomposed to non-overlapping patches, where . Finally, these patches will be flattened into vectors:

In 2D ViT, we only have patch as input.

Linear embedding and class token

Before the transformer block, vectors will linearly map to the embedding vector , where are learnable parameters. Typically the classification token will be added as vector .

The superscript of z indicates that it has been processed by several blocks, and the transformer has encoding blocks.

Query/Key/Value and Self-attention

For 𝓁 block, the query/key/value vector can be computed as: 𝓁𝓁𝓁𝓁𝓁𝓁𝓁𝓁𝓁 Where is an index over multiple attention heads and . Thus the self-attention in video can be noted as: 𝓁𝓁𝓁𝓁 Encoding

The results from each head can be represented as: 𝓁𝓁𝓁𝓁𝓁 > 𝓁 can be resize as Matrix which size as after drop 𝓁, and this matrix can be indexed by , finally it will obtain the single number.

Then, it will passed through series processing as follow: 𝓁𝓁𝓁𝓁𝓁 Obtain result

Finally, the class can be obtain 1-hidden-layer MLP that accept class token as input and predict score for each class.

Space-Time Self-Attention

To reduce the computational cost, this paper propose Divided Space-Time Attention (T+S), where temporal attention and spatial attention are separately applied. The space and time attention are computed as: 𝓁𝓁𝓁𝓁


Architecture for different attention

Furthermore, other two attention are also be used:

  • Sparse Local Global (L+G)
    1. Computes a local attention by considering the neighboring patch
    2. Calculates a sparse global attention over the entire clip using a stride of 2 patches along the temporal dimension and also the two spatial dimensions.
  • Axial (T+W+H): separately calculate time/width/height attention.
Visualization of various attention


Input: Chips, patch size is .

  • Sampling a single temporal clip in the middle of the video.
  • Using 3 spatial crops (top-left, center, bottom-right) from the temporal clip.


Different space-time attention schemes
Comparing to other modles
Comparison of training parameter quantity with other models
Accuracy on Kinetics-400
Accuracy on Kinetics-600
Accuracy on SSv2 and Diving-48