Convolutional Neural Networks are one of the most exciting advancement in AI. Originally proposed by Yann LeCun et. al. in 1998
CNNs or ConvNets are a variant of regular neural network that take advantage of the spatial nature of the input by making explicit assumption that the input are images. Instead of a linear list of neurons that make up a layer in regular neural network, they have a series of layers that take 3D input volume, perform a differentiable function with or without learnable parameters and transform the input to 3D output volume.
Convolutional neural network benefit from these powerful concepts:
Unlike in a fully connected neural network, CNNs don’t have every neuron in one layer connected to every neuron in the next layer. Instead, we only make connections in small 2D localized regions of the input image called the local receptive field. For each receptive field, there is a different hidden neuron in its first hidden layer. This greatly reduces the number of weights to train and the computational complexity of the network.
We can pad the input matrix around the border by symmetrically adding zeroes. Padding the input makes computation convenient by preserving input volume and prevents fast data loss at border. This padding process is controlled by a hyperparameter called zero padding.
The sliding of the local receptive field can be controlled by another hyperparameter called stride length, which is the number of pixels the field is moved at a time. Both these hyperparameters control the spatial size of the output volume.
Another peculiarity of CNNs is that it uses the same weights and biases for each of the hidden neurons. By sharing the weights and biases, the network is compelled to learn the weights to detect a feature at different parts of the image. This means that all the neurons in the layer detect exactly the same feature, just at different locations in the input.
This makes CNNs well adapted to translation invariance of images, which means the network tries to detect a feature in the image, and once it detects the feature, the location of the feature becomes irrelevant.
These weights that define the feature map are also called kernel or filter. To perform image recognition, we need more than one feature map. So a convolutional layer consists of several different feature maps.
Another big advantage of sharing weights and biases is that it also greatly reduces number of parameters that the network needs to learn, helping in speeding up the learning process. Less number of parameters also means less chance of overfitting the input data.
Pooling layers are another type of layers typically used after convolutional layers. They simplify or summarize the information from the convolution layer by performing a statistical aggregate like average or max by taking each feature map and producing a down sampled feature map.
Most common form of pooling layer used is maxpooling, which simply outputs the maximum activation of region of the convolution layer’s output.
Deep fully connected networks face the problem of learning slowdown because of early saturation of sigmoid and tanh activation functions. To avoid this problem, CNNs are mostly used with rectified linear units or ReLU activation function. It is defined as \(f(z) = max(0,Z)\), which basically means all the negative activations are clamped at 0. ReLU units have shown to outperform networks based on sigmoid and tanh activation functions and are generally implemented as a layer which performs the element wise activation function to the input.
A plot from Krizhevsky et al.
CNNs takes the advantage of spatial information of the input. The Convolutional and Pooling layers learn about local spatial structure in the input image, while the latter fully connected layer learns at more abstract level, integrating the knowledge from these other layers to perform classification.
The architecture is laid out as a series of layers of Convolution, ReLU and pooling layers with fully connected layer/s at the end which produce output scores.
All the layers implement a forward and backward API. Forward propagation calculates the activations, and backward propagation uses the gradient from above and the local gradient to calculate gradients for the layer’s learnable parameters, and the input gradient which is propagated back to the earlier layer.
In deep neural networks, gradients are unstable, leading to either vanishing or exploding in earlier layers. How do CNNs overcome this problem?