By: Natalie Oulikhanian
As humans, we take our visual cortex for granted every day. Without giving a second thought, we are not only able to immediately see our environment but we, more importantly, can process the objects that surround us, such as a tree or a friend walking by. Our other senses apply these same rules; to hear someone speak is not the same as to listen to and understand what they are saying, as to see an object is not the same to classify what the object is. Researchers have tried, and continue to try, to allow computers to have this similar function: its own interpretation of the visual data it is given. This study is often referred to as Computer Vision (CV), an interdisciplinary field under the larger subject of Artificial Intelligence (AI), which tries to replicate narrow aspects of human behaviour through the limits of how computers work.
To firstly understand how a computer is able to interpret the world around us similarly to humans, we must establish the differences between how humans and computers detect, classify, and recognize objects. When humans look at a scene or a single object, we are able to immediately translate the unique features of an object into the object we have previously been taught to classify it as. This is why in some cases, we might be able to detect certain objects such as a vase in places that might resemble a vase’s shape. Similarly, we commonly experience seeing faces in objects that are clearly not faces, since we recognize the properties that make up an object face (such as the two eyes, nose, and mouth of the image above). However, computers are not as advanced in this and work differently. The skills that humans have of being able to quickly recognize patterns in our environment, generalise from previous experience, and adapt our recognition skills to variations of image environments are not shared with machines. Computers cannot interpret the images or data in the same way that a human would understand the visual. Instead, it will view each piece of data as a collection of numbers, and specifically in an image this would be an array of numbers corresponding to the values of colour on a pixel. Depending on the data given, a computer’s array of numbers can either be in colour: where pixels are understood as a combination of red, green, and blue values, (an array of 1 x 1 x 3 for one pixel) or in a black and white scale: where it is only determined by a brightness or darkness (an array of 1 x 1 x 1 for one pixel).
For computers to recognize people by matching who they are to an image or video of them, the first step that a computer must do is detect whether or not the data that a computer is given includes a face, and where exactly the face(s) are located. To do this, computers will adjust to the concept of identifying unique features which humans use to interpret images. These identifying characteristics of a face may be two eyes, a nose, mouth, or jaw. Although we are able to physically list out what features we find are necessary to make up a face, a computer will be able to do this themselves and have a variety of features to look out for in this process. Initial methods to accomplish face detection were built upon human knowledge and the rules that humans imputed to computers; however, this technique quickly became ineffective with more appropriate technology being introduced. Instead of having to handpick all the rules to build an incredibly complicated program, researchers found a better solution in letting the computer learn how to solve the problem itself by discovering their own unique features.
This is where Machine Learning (ML), a subsection of AI that enhances a machine’s ability to learn, becomes involved with the task of image detection. However, more specifically, a technology within the field of ML, called Deep Learning (DL), uses artificial neural networks to simply achieve the problem of detection. From the name, these artificial neural networks are loosely analogous to biological networks in the visual cortex which uses the concept of letting neurons influence each other in a layered reaction to process information. A computer is trained using a specific type of artificial neural network which works best for the task of detecting images called Convolutional Neural Networks (CNNs). In the detection process, thousands of images that are classified as a face (or any other object, depending on the object detection task) are fed into the CNN so that the machine will learn the unique features that make up a classified image themselves. The learning process uses the layered system of artificial neurons to detect small characteristics, such as edges, curves, or simple colours, so that it can learn to detect larger, more identifiable features, such as eyes, noses, and mouth. By being trained on this data, when a new and unseen array of pixels is sent to the machine, the computer will apply what it has learned from its training and detect where the face is and where it is located on the image using a bounding box. With the simulation of biological neural networks, a computer is able to detect faces by identifying facial features and characteristics, much like a human. If properly trained, the technology can provide machines the possibility to detect more accurately than humans when differentiating similar-looking objects such as a Hungarian sheepdog and a mop. This is because CNN’s allow computers to recognize characteristics we might not pick up due to our bias of generalising from experiences.
Once the step of detecting a face’s location is complete, facial recognition will identify who exactly the images of already detected faces represent in two steps. The first is preparing the face to create a baseline of faces to simplify comparisons. That means trying to minimize the effect of things like perspective and face orientation so that by identifying standard landmarks on a face such as eyes, nose bridge, and jawline, the machine can try its best to make the face look like it’s facing straight ahead using affine transformations such as shears, rotations, and scales.
The second step is analysing the face for key features. There are many successful methods to accomplish this task and it is common today to find most or all of them to significantly incorporate CNNs and DL for the machine’s training. In one example, the key features that have been established up until this point during the face detection process are discarded. This is because using these features are not an accurate representation of how we can, or computers should, recognize faces. For example, if you are trying to remember and imagine someone’s face it is often the unique features that stand out from someone such as their freckles, rather than a comparison with their typical facial features as something that makes up what a face is.
Similarly to the use of DL in object detection, this method for facial recognition uses CNNs to decide what the most appropriate measurements are in another training process. This training process works by using three images of people: two photos of the same person (photos A and B), and a photo of someone different (photo C). The machine will then try to place measurements so that the measurements of A and B are as similar as possible while those of A and C are as different as possible and will be repeated until it has a set of measurements to carry out so that they are unique for each person. This allows for when an unknown image is fed through the system, a computer will only need to find which person has the nearest measurements to the unidentified face (such as a person using facial recognition on a phone) to identify the person.
By understanding how object detection and recognition systems work, it must be necessary to bring up the difference of the recognition of faces and people to other objects. With this comes the problem of privacy, consent, and the rising concern of inadequate data of people. To recognize faces there must be accurate and unbiased data to the variety of differences in facial characteristics such as race, skin colour, facial features, accessories, makeup, and far more examples. If data is not properly picked out, then training will be inaccurate, and therefore the entirety of identification or detection processes will also be defective. To prevent the possibility of false allegations and identifications, especially with the constantly proposed application of this system in public security, data must be reliable and relevant, while also ensuring that those controlling the technology have the right intentions. Although, it is also important to note that the technology itself is not only used for the identification of faces, but it rather has an endless variety of applications through using different types of objects that do not have the demand of consent or privacy. For example, the same techniques have proven massive advancements and implementations in healthcare where object detection and recognition are used for the diagnosis of skin disease in medical images. Deep Learning and other related fields in Artificial Intelligence promise new approaches to computer programming with its concept of having the machine learn themselves.
1. How do computers replicate the visual cortex to classify images?
When we detect and classify objects, we look for the features that represent an object. For example, in a face, we may be able to list out facial features such as a mouth or eyes, but we unconsciously detect these by interpreting the smaller characteristics such as edges or curves. To classify images, computers use this same technique using artificial neural networks to filter through necessary features of an object to create an outcome of what type of object the computer detects in the image (or, an array of numbers according to pixel values).
2. How harmful are facial recognition systems?
A significant difference between applying object recognition technology to faces rather than other objects is the problem of privacy of the individuals being identified and whether or not the technology should be used for security reasons, such as to identify criminals. The process of recognition is not capable of bias itself, however, factors such as inadequate data fed to the system or the intentions an individual with the technology has can affect not only the accuracy of these systems but also create improper and damaging uses for recognition systems. Preventing these factors can help reduce the harm that facial recognition systems can have on individuals that are affected by its applications.
No changes were made, Wikimedia Commons, Magnus Manske: https://commons.wikimedia.org/wiki/File:Faces_in_places_(3205402229).jpg, License: Creative Commons Legal Code
Image made by Natalie Oulikhanian