📃 Different Normalization

Mon, 21 Jul 2025 00:00:00 +0000

Introduction Link to heading

Normalization techniques are fundamental to training deep learning models effectively. They help stabilize and accelerate training, improve generalization, and prevent internal covariate shift. Below is a summary of the most common normalization techniques, their mechanisms, key papers, and differences.

🔑 Summary of different type of Normalization Link to heading

Name	Normalized Over	Key Paper	Common Use Cases	Strength	Weakness
Batch Normalization (BN)	For Conv: per channel across B×H×W; For MLP: per feature across B.	Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (ICML 2015)	Computer Vision Field like Image Classification, Detection, Segmentation	Stabilizes activation scale; Enables larger learning rates, Speeds convergence; Adds implicit regularization	Less suited to online / streaming / RNN small-batch settings, can cause issues in domain shift or micro-batch training
Layer Normalization (LN)	Per sample (token) across its feature (hidden) dimensions (e.g. For shape B×L×D or B×D: normalize over D; for Conv rarely used, would be over C×H×W of that sample)	Layer Normalization (arXiv 2016)	Transformers (NLP & Vision), RNNs, small-batch or batch=1 training	Independent of batch size, identical behavior in training & inference, stable for variable-length sequences, improves gradient flow (esp. Pre-LN Transformers)	Provides less implicit regularization, does not leverage cross-sample statistics
Instance Normalization (IN)	For conv input BxCxHxW: each sample & channel independently over its spatial pixels HxW (no cross-batch, no cross-channel).	Instance Normalization: The Missing Ingredient for Fast Stylization (ECCV 2016)	image generation (GAN generators), image-to-image translation (e.g., style/appearance adaptation)	Batch size–independent, effectively strips instance-specific style (contrast, color cast), aiding fast stylization	Discards global intensity/contrast cues useful for recognition → poorer performance on classification/detection; lacks batch-level regularization
Group Normalization (GN)	For input BxCxHxW: per sample, split channels into G groups (size C/G); compute mean & var over (C/G)xHxW inside each group.	Group Normalization (ECCV 2018)	Small-/micro-batch CNN training, cases where BN fails with batch sizes 1–4.	Batch-size independent; stable for tiny or variable batches; often better than BN when batch is very small.	Extra hyperparameter (G) to tune; less implicit regularization than BN, grouping may not align with the semantic channel structure
Weight Normalization (WN)	Each weight vector of a neuron/output channel.	Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks (NIPS 2016)	RNN / seq models where BN is hard, small-batch or online / RL training (policy & value nets)	Negligible inference cost (can fold into static weights); works with streaming / RL; complements other norms (can combine with LayerNorm)	Scale may drift (need LR tuning); benefit can vanish with strong adaptive optimizers; less helpful for very deep Transformers (other norms preferred)
Spectral Normalization (SN)	Each weight tensor (e.g. matrix / conv kernel reshaped to 2D)	Spectral Normalization for Generative Adversarial Networks (ICLR 2018)	GAN discriminators, robustness / Lipschitz-constrained models, etc.	Enforces (approx.) 1-Lipschitz per layer (controls gradient explosion)	Extra cost (power iteration each step); only constrains the largest singular value (other singular values can still drift)
RMS Normalization	Per sample (token) feature vector	Root Mean Square Layer Normalization (NIPS 2019)	Modern Transformer / LLM blocks; very deep pre-norm architectures, low-precision (FP16/BF16)	Simpler & slightly cheaper than LayerNorm, numerically stable in mixed precision, good for very deep stacks (retains strong gradient path)	Mean not zeroer, possible drift, needs careful init/residual scaling, and isn’t fully interchangeable with zero-mean LN methods.

📘 Explanation of How They Work Link to heading

Batch Normalization (BN) Link to heading

The Batch Normalization normally used in computer vision field, typically the CNN. Generally, the input shape of BN is $\text{Batch}(B)\times \text{Channel}(C) \times \text{Height}(H)\times\text{Width}(W)$.

Normalization on ChengAo Shen

📃 Different Normalization

Introduction Link to heading

🔑 Summary of different type of Normalization Link to heading

📘 Explanation of How They Work Link to heading

Batch Normalization (BN) Link to heading