Neural Networks without the Math

When was the last time you heard someone using Artificial Intelligence, Deep Learning, or Neural Network buzzwords? Definitely, Not so long.

The mathematical aspects behind neural networks are quite challenging but the training process is analogous. Since neural networks follow an automated process of learning the feature representation from the data for classification, regression, or generation tasks, we employ mathematical functions to supervise the learning process.

From this blog, we are planning to provide a brief understanding of neural networks without mathematical equations.

Neural networks are a combination of feature extractor layers and vector/scalar activation output layer. The feature extractors comprise multiple kernels performing convolutions in the hidden layers. The lower layers extract edges and corners since there is a hierarchical decomposition of the input image over the feature extraction phase, the top layers extract higher structural representations.

An Input Image

Photo: Unsplash, A.G

Visualization of feature maps extracted by the convolution layer in Resnet50 and VGG16 Architecture respectively. Each of these blocks represents the feature pattern captured by a Neural Network.

Resnet50 conv1 Layer feature map

VGG16 block2_conv1 Layer feature map

Convolution uses kernels/filters to extract representations from the image. Different kernels can extract different feature representations from images. For different tasks such as sharpening, edge detection, blurring, different kernels can be employed. Neural networks reduced the burden of manually designing kernels for feature extraction. Neural networks randomly initialize the kernel weights and predict the output. Then the difference between the expected and actual output is calculated, then through backpropagation, all the kernel weights are updated. This iteration happens until a good performance accuracy is obtained. Each iteration is called an epoch and on how many images these updates are performed is the Batch size.

Convolution kernel is a matrix which slides upon the image to extract local neighboring representations and transform the captured representations as the neural network layers get deeper. A raw image contains noise, to normalize the noise we can use convolution. By selecting an appropriate kernel matrix, the respective noise normalized output image is obtained.

A good analogy

It is impossible to read the image and understand its contents by closing our eyes. But now, imagine that we have an image surface consisting of its content edges. These edges are elevated above the image surface and are distinguishable from the surface. Now, by sliding our hand through the surface, we can find out the edges representation and try to decode the contents present on the surface without seeing the image. Convolution also performs similar operations using kernels. The kernel weights are very important to optimize and efficiently extract the representations.

Forward Propagation

The input image pixels are fed as input to the input layer. Convolution is applied to image pixels per channel. Edges are extracted from input image pixels. From the edges we further extract patterns. From patterns, we predict output class labels.

The input image is a grid of pixel values. Pixels come together to form representations in an image. These representations are maybe wheels in cars, ears of dogs, etc. These representations are absorbed by the neural network. The more number of variational images the neural network comes across, more variational representation is grasped by the network.

Through convolution, the pixel representations are captured by kernel parameters (kernel matrix elements). The pixel values from the input images are transformed into pixel sequence. We multiply weights with every pixel from this pixel sequence. Then by taking the pixel values(input activations) and weights, we compute the weighted sum.

The weighted sum computed will be a scalar value. To boundary the weighted sum within a range we use activation functions such as RELU, Sigmoid, etc. Non-linear activation introduces the non-linearity dependency between the image features and the output.

Activation output is the measure of “How positive the weighted sum is?”

In the case of Sigmoid, if we want the activation to light up beyond the threshold value instead of zero, we introduce an additional parameter called bias along with the weighted sum. This bias is an inactive measure. An activation gets lit up only by crossing the bias.

In neural networks, the kernel values are randomly initialized at the beginning of forward propagation. Through multiple epochs during training, these kernel values(weights) get updated with the notion of reducing prediction loss. The steps taken to reduce the loss is called the learning rate.

Back Propagation

When the kernels are randomly initialized in the first forward pass, most probably the predicted output will be inaccurate. A bad weight initialization will cause longer training time of the model. Now the difference between the predicted and actual input is calculated. This difference is called the loss. Our objective by training a neural network is to reduce this loss so that the actual label will be predicted by the network. Over time, our network predicts the label with a confidence value. We train the model to be more confident in predicting the actual label.

Since the network is interconnected from the output layer to the input layer through hidden layers, It is unchallenging to update activations in the hidden layers proportional to the loss incurred in the output layer. For instance, the output layer is connected to the immediate predecessor (hidden) layer. This loss difference is reflected in this immediate predecessor’s activations. We update the weights which influence the predecessor activations. This pattern follows till the first hidden layer. These weight updates are carried out until we train the model.

Karthik R