Batch Normalization Instance Normalization Layer Normalization: Structural Nuances

This short post highlights the structural nuances between popular normalization techniques employed while training deep neural networks.

I am hoping that a quick 2 minute glance at this would refresh my memory on the concept, sometime, in the not so distant future.

Let us establish some notations, that will make the rest of the content, easy to follow. We assume that the activations at any layer would be of the dimensions NxCxHxW (and, of course, in the real number space), where, N = Batch Size, C = Number of Channels (filters) in that layer, H = Height of each activation map, W = Width of each activation map.

Trending AI Articles:

1. Machine Learning Concepts Every Data Scientist Should Know

2. AI for CFD: byteLAKE’s approach (part3)

3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data

4. Top 5 Jupyter Widgets to boost your productivity!

Generally, normalization of activations require shifting and scaling the activations by mean and standard deviation respectively. Batch Normalization, Instance Normalization and Layer Normalization differ in the manner these statistics are calculated.

Batch Normalization

In “Batch Normalization”, mean and variance are calculated for each individual channel across all samples and both spatial dimensions.

Instance Normalization

In “Instance Normalization”, mean and variance are calculated for each individual channel for each individual sample across both spatial dimensions.

Layer Normalization

In “Layer Normalization”, mean and variance are calculated for each individual sample across all channels and both spatial dimensions.

I firmly believe that pictures speak more than words, and I hope this post brings forth the subtle distinctions between several popular normalization techniques.