Basic Convolutional Neural Network (CNN) Architecture

Updated: Apr 21

In my last article, which you can find here, we defined neural networks as the heart of deep learning. Neural networks add human intelligence to machines by literally simulating the main part of a human brain (which you can also find more about in Joanne’s post here if you are interested): the neural network that transfers neurons throughout the brain to output response to certain stimulus. However, there are many types of these artificial neural networks. One key type, my personal favorite and speciality, is convolutional neural networks or CNNs. CNNs are primarily used for image classification, which is a very big, useful application of technology these days. This article will show you the conceptual outlook on CNN.

The main reason researchers and experts all around the world use CNNs is for image detection. This is because the essence of an image is a large, multidimensional array of assigned numerical values to each pixel comprising the image to a computer. These numerical values of coloration and intensity, usually represented in RGB, are the only inputs that are passed into image detection AI. However, to accomplish the most optimal way, there is a certain structure CNNs follow.

This application that enhances image detection is very useful and impactful to fields such as healthcare and advancements in autonomous vehicles through better object detection and medical imaging.

The structure we will be going in to is the basic and most popular CNN architecture. CNNs first take the image as the input data, which is necessary to build a model. Then, it passes through the meat of the model, or the convolutional, nonlinear, downsampling, and fully connected layers to release an output, which is the detection sequence.

The convolutional layer of the neural network creates a feature map that holds features detected from the filter that scans the image a few pixels at a time. These features are later used to predict, label, or identify the image. Adit Deshpande of UCLA explains it with a simple analogy. “Imagine a flashlight that is shining over the top left of the image. Let’s say that the light this flashlight shines covers a 5 x 5 area. And now, let’s imagine this flashlight sliding across all the areas of the input image.” One important thing to note about each feature is that it has a weight. The weight of a feature determines the importance of each feature.

The next layer is the downsampling or pooling layer. Now that you have lots of features collected from the convolutional layer, you need to simplify down the amount of information you have and decide which features are important in getting your output. Reducing the amount of features ensures faster computational speeds as there are less features the algorithm needs to scan for identification of the image.

The last three layers of the CNN are the fully connected input layer, fully connected layer, and fully connected output layer. The fully connected input layer “flattens” the output of the previous layers to turn them into a single vector to act as the input to the next layer. The fully connected layer takes the vector created through the fully connected input layer and applies the weight of each feature to predict an accurate label. The final layer, the fully connected output layer, generates the final probabilities to classify and predict a label for the image, concluding the network.

There are millions of tutorials, articles, and videos out there on Google. If you are interested in seeing this in an application, feel free to explore programming CNNs using Tensorflow and Keras.


Yash Gupta is the head programmer for Helyx and has programmed a lot, working mainly in deep learning, but also in robotics. If you have any questions about this or anything deep learning related, please feel free to contact him @

Affiliated with:


© 2020 by The Helyx Initiative