Deep-Learning on ChengAo Shen

✏️ Gumbel Softmax

Tue, 22 Jul 2025 00:00:00 +0000

Motivation. In many models we need to select a discrete option inside the computation graph (e.g., pick one branch of a network). A hard argmax is non-differentiable, so gradients can’t flow through it. Gumbel-Softmax provides a continuous, differentiable approximation to this discrete sampling step.

Gumbel-Max Trick Link to heading

Assume we have discrete distribution

$X$	1	2	3
$p$	0.2	0.3	0.5

And want to get $X$ follow this distribution. If we directly sample from distribution, the $X$ can’t calculate from $p$. Which means $X$ can’t be differentiated w.r.t. $p$. This means we can’t do back propagation.

📃 Different Normalization

Mon, 21 Jul 2025 00:00:00 +0000

Introduction Link to heading

Normalization techniques are fundamental to training deep learning models effectively. They help stabilize and accelerate training, improve generalization, and prevent internal covariate shift. Below is a summary of the most common normalization techniques, their mechanisms, key papers, and differences.

🔑 Summary of different type of Normalization Link to heading

Name	Normalized Over	Key Paper	Common Use Cases	Strength	Weakness
Batch Normalization (BN)	For Conv: per channel across B×H×W; For MLP: per feature across B.	Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (ICML 2015)	Computer Vision Field like Image Classification, Detection, Segmentation	Stabilizes activation scale; Enables larger learning rates, Speeds convergence; Adds implicit regularization	Less suited to online / streaming / RNN small-batch settings, can cause issues in domain shift or micro-batch training
Layer Normalization (LN)	Per sample (token) across its feature (hidden) dimensions (e.g. For shape B×L×D or B×D: normalize over D; for Conv rarely used, would be over C×H×W of that sample)	Layer Normalization (arXiv 2016)	Transformers (NLP & Vision), RNNs, small-batch or batch=1 training	Independent of batch size, identical behavior in training & inference, stable for variable-length sequences, improves gradient flow (esp. Pre-LN Transformers)	Provides less implicit regularization, does not leverage cross-sample statistics
Instance Normalization (IN)	For conv input BxCxHxW: each sample & channel independently over its spatial pixels HxW (no cross-batch, no cross-channel).	Instance Normalization: The Missing Ingredient for Fast Stylization (ECCV 2016)	image generation (GAN generators), image-to-image translation (e.g., style/appearance adaptation)	Batch size–independent, effectively strips instance-specific style (contrast, color cast), aiding fast stylization	Discards global intensity/contrast cues useful for recognition → poorer performance on classification/detection; lacks batch-level regularization
Group Normalization (GN)	For input BxCxHxW: per sample, split channels into G groups (size C/G); compute mean & var over (C/G)xHxW inside each group.	Group Normalization (ECCV 2018)	Small-/micro-batch CNN training, cases where BN fails with batch sizes 1–4.	Batch-size independent; stable for tiny or variable batches; often better than BN when batch is very small.	Extra hyperparameter (G) to tune; less implicit regularization than BN, grouping may not align with the semantic channel structure
Weight Normalization (WN)	Each weight vector of a neuron/output channel.	Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks (NIPS 2016)	RNN / seq models where BN is hard, small-batch or online / RL training (policy & value nets)	Negligible inference cost (can fold into static weights); works with streaming / RL; complements other norms (can combine with LayerNorm)	Scale may drift (need LR tuning); benefit can vanish with strong adaptive optimizers; less helpful for very deep Transformers (other norms preferred)
Spectral Normalization (SN)	Each weight tensor (e.g. matrix / conv kernel reshaped to 2D)	Spectral Normalization for Generative Adversarial Networks (ICLR 2018)	GAN discriminators, robustness / Lipschitz-constrained models, etc.	Enforces (approx.) 1-Lipschitz per layer (controls gradient explosion)	Extra cost (power iteration each step); only constrains the largest singular value (other singular values can still drift)
RMS Normalization	Per sample (token) feature vector	Root Mean Square Layer Normalization (NIPS 2019)	Modern Transformer / LLM blocks; very deep pre-norm architectures, low-precision (FP16/BF16)	Simpler & slightly cheaper than LayerNorm, numerically stable in mixed precision, good for very deep stacks (retains strong gradient path)	Mean not zeroer, possible drift, needs careful init/residual scaling, and isn’t fully interchangeable with zero-mean LN methods.

📘 Explanation of How They Work Link to heading

Batch Normalization (BN) Link to heading

The Batch Normalization normally used in computer vision field, typically the CNN. Generally, the input shape of BN is $\text{Batch}(B)\times \text{Channel}(C) \times \text{Height}(H)\times\text{Width}(W)$.

🤗 Introduction to Generative Models

Mon, 26 Aug 2024 00:00:00 +0000

Generative Models are part of unsupervised learning models that can learned from the datasets without any labels. Unlike other unsupervised models to manipulate, denoise, interpolate between, or compress examples, generative models focus on generating plausible new samples having similar properties to the dataset.

Latent variable models: mapping the data examples $\mathbf{x}$ to unseen latent variables $\mathbf{z}$ which can capture the underlying structure in the dataset.

📚 Using Custom Dataset in PyTorch

Thu, 26 Oct 2023 00:00:00 +0000

In order to decouple dataset code and model code, PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

Load a Dataset Link to heading

If we want to use Data, we must have Data first. Fortunately, PyTorch domain libraries provide a number of pre-loaded datasets. All of them is the subclass of theDataset. Now we use Fashion-MNIST, one of them, to show how to load a dataset.

💡 Introduction to Transformer

Mon, 23 Oct 2023 00:00:00 +0000

Transformer is a really popular method in modern neural networks. We have BERT or GPT to process the natural language and ViT to deal with computer vision. In this essay, you will understand what is the transformer and why the transformer works. But be careful, limited by my knowledge, I can’t show some mathematical theories or code of transformer for you.

Why do we need the Transformer? Link to heading

In the NLP( Natural Language Processing) field, the text dataset always has some obvious features that prevent us from using MLP.

📝 Common PyTorch Code Snippets

Tue, 21 Mar 2023 00:00:00 +0000

Notes inspired by d2l’s PyTorch content

Device Switching Link to heading

Try using GPU:

def try_gpu(i=0):
    if torch.cuda.device_count() >= i + 1:
        return torch.device(f'cuda:{i}')
    return torch.device('cpu')

def try_all_gpus():
    """Find all available GPUs"""
    devices = [torch.device(f'cuda:{i}')
               for i in range(torch.cuda.device_count())]
    return devices if devices else [torch.device('cpu')]

Accumulator Link to heading

class Accumulator:
    """Used for accumulating values"""
    def __init__(self, n):
        self.data = [0.0] * n

    def add(self, *args):
        """Add a list of values to the existing data"""
        self.data = [a + float(b) for a, b in zip(self.data, args)]

    def reset(self):
        self.data = [0.0] * len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

Accuracy Link to heading

For training accuracy evaluation, you can use the following code

⚕️ Brief introduction of the Tensor in PyTorch

Wed, 26 Oct 2022 00:00:00 +0000

Tensor is a specialized data structure that is very similar to arrays and matrices. We can use it to encode the input and output of the model. Tensors can run on GPUs and other hardware.

Initializing a Tensors Link to heading

Tensors can be initialized in various ways,

# Import the library
import torch

# Directly from data
data=[[1,2],
      [2,3]]
t_data = torch.tensor(data)

# From Numpy
import numpy as np
np_data = np.array(data)
t_np =  torch.from_numpy(np_data)

# From other tensors
# In this way, it will retains same properties
t_ones = torch.ones_like(t_data) 
# override the datatype
t_random = torch.rand_like(t_data,dtype=torch.float) 

# With static shape
shape = (2,3,)
rand_t = torch.rand(shape)
ones_t = torch.ones(shape)
zeros_t = torch.zeros(shape)

Careful: If you initialize the tensors from Numpy, they will share the same underlying memory, which means that if changing the numpy array, the tensor will change too.