This article was written by Akvelon Data Scientist/Machine Learning Researcher Daniel Diroff. It was originally published on Medium.
In November 2019, Akvelon, Inc. attended the North American AI & Big Data Expo in Santa Clara, California to share some of the exciting projects that the team has been working on.
One such project was a computer vision task aiming to analyze the attitude of one’s facial expression in real time. We emphasize the word attitude here as it is fairly common in this context to ask the model to instead predict emotion, i.e. one of: happiness, sadness, fear, disgust, anger, surprise and neutral. These 7 emotions stem from the 20th century Psychologist, Paul Ekman’s basic emotions. For us, attitude will be a cruder descriptor — simply either positive, neutral or negative.
The idea behind developing a real-time attitude classifier was 2-fold, one to gain experience working with and demoing a deep convolutional neural network model and secondly, to build a full working application that can aid in conducting an interview of a prospective employee. The latter is indeed being used internally here at Akvelon. The hope is that the model can help the interviewer gauge how the discussion is going, and possibly adjust the approach if things seem to be going in the wrong direction.
Demoed at the AI and Big Data Expo was a glimpse at this internal tool. In real-time, the model would process a video stream of the individual and make frequent predictions at various points in time. The history of the results are logged on the screen, allowing the user to see, for example, the shift from positive and negative if one went back and forth between smiles and angry expressions. Anyone can try it for free here. Feedback on the model was very positive, and acted as an attention grab, attracting people to come and stay near the Akvelon booth.
The Model: As with many computer vision tasks, we utilized a deep convolutional neural network to make our predictions. The model was trained to accept a standardized image and output 3 real values, each of which is thought of as being the “score” or “confidence level” associated with each of the possible 3 attitudes. The final prediction is then the attitude with maximal score.
Also, as with many computer vision or advanced machine learning tasks, we utilized the transfer learning technique, as the amount of data (and research) required to develop and train a good model from scratch is impractical to attain in many situations. We researched and tested several different pre-trained models with varying architectures but eventually landed on a variant of a ResNet, Resnet50_ferplus_dag. This model and its pre-trained weights are due to Samual Albanie, Arsha Nagrani, Andrea Vedaldi and Andrew Zisserman. More information on the model and many others can be found here. The relevant academic papers [R1], [R2] by these authors can be found in the references section at the end of this article. We are very appreciative of their great work.
The basic emotions mentioned above have some slight variations depending on the source. Specifically, our pre-trained model was designed to output predictions on such a variant with 8 basic emotions and thus, the final layer of the network was a 8-output linear layer. To align with our desired predictions, this final layer was then replaced with a linear layer with 3 outputs.
All work fine-tuning the model was done utilizing Facebook’s deep learning framework PyTorch. PyTorch allowed us to easily utilize the free access to GPUs available via Google’s Colab. This was very significant as the dataset we gathered had in excess 49,000 examples. For a great overview and introduction to PyTorch, see the talk by Stefan Otte given at PyData Berlin 2018 available on YouTube.
The dataset gathered and used for our task was a combination of a couple Kaggle datasets:
- [R4] fer2013 — Challenges in Representation Learning: Facial Expression Recognition Challenge
- [R5] Emotion Detection From Facial Expressions
These were initially labeled so as to be used for the more standard task in predicting emotion rather than attitude. To address this issue, we mapped emotions to attitudes: “happiness” went to “positive”, “neutral” went to “neutral” and the rest went to “negative”. Of course, this decision can be scrutinized but this seemed most natural for at least the first iteration of the project. Perhaps this could be revisited in the future.
Data Preprocessing: Most of the examples from the above two datasets consisted of grayscale images of size 48 by 48. To prepare them to be fed into the neural network, the images were rescaled to 224 by 224, as required by the pre-trained model, before being transformed into PyTorch tensors and standardized (subtracting by the mean and dividing by the standard deviation).
We split our data into train and test sets randomly with a 70%/30% split, and specifically for the training set we applied some data augmentation by means of random cropping and horizontal flipping. Data augmentation with PyTorch need not mean technically increasing your training set, but rather effectively increasing your training set by means of prescribed transforms. A good explanation of this can be found in this Stack Overflow post. Essentially one can assign transformations to an instance of a PyTorch data loader so that when looping over such an instance, these transformations are applied to each example. Thus, if there is a random element to these transformations (random cropping for example), these will effectively augment the dataset.
Training: Our first attempt included loading the pre-trained model, replacing the final layer with an appropriate one and retraining only this final layer. While the results were encouraging, ultimately, we improved the model by retraining not only the new output layer but the previous 3 as well. PyTorch makes this easy to accomplish, namely you can “freeze” all parameters by two simple lines of code
Thus, to allow retraining on only the final four layers, we first froze all parameters and then immediately overwrote the final four. As PyTorch by default sets newly defined parameters with
this was a quick and easy way to accomplish what we wanted.
Furthermore, after tweaking the various hyperparameters we settled on the SGD optimizer and a learning rate scheduler with step size = 2, gamma=0.1 and an initial learning rate of 0.01. These values were found by trial and error, perhaps more fine-tuning could be done to make further improvements.
In the end, with our training set being approximately 34,000 examples and utilizing a freely available GPU via Google Colab, training took just under 40 minutes with 6 epochs. Beyond 6 epochs, we saw the effects of overfitting.
Results: A goal of ours at this stage was to not only compute some relevant metrics for our own model, but to compare them as well with other established computer vision products. One such is Microsoft’s Face API, which, among many other things, does the more classical task of assigning one of the 7 basic emotions (or rather a stochastic vector of emotion probabilities or confidence scores) to each image. By applying our crude emotion to attitude bucketing technique, we can tweak their model to more align with ours.
We admit that this approach is not completely kosher as in the end we would be comparing our model, which was trained specifically to classify by attitude, to a somewhat tweaked version of Microsoft’s model which was not trained to specifically classify by our categories. Nevertheless, the results were the following:
The most basic metric to highlight is simply the accuracy, our model achieving over 84%.
The metrics, Matthews Correlation Coefficient and the Cohen’s Kappa Coefficient, are two additional ways of measuring the quality of a multiclass classifier which are a bit more sophisticated. More info on these can be found here and here. The results above are only over a common test set of just over 1,000 images because of the limits of the free version we used of Microsoft’s Face API.
In the end, we were happy about this result and how it was received at the expo. We absolutely encourage users to try the demo out themselves here. Akvelon has been internally focusing its resources on machine learning education and acumen amongst its employees and this project was one of several that came out of that initiative. With this, Akvelon is determined to be a player in the recent wave a data science and machine learning.
[R1]: Albanie, Samuel, and Nagrani, Arsha and Vedaldi, Andrea, and Zisserman, Andrew, “Emotion Recognition in Speech using Cross-Modal Transfer in the Wild.” ACM Multimedia, 2018.
[R2]: Jie Hu, Li Shen, Samuel Albanie, Gang Sun and Enhua Wu. “Squeeze-and-Excitation Networks.” IEEE transactions on pattern analysis and machine intelligence (2019).
[R5]: (Data Source) Emotion Detection From Facial Expressions, Kaggle Competition.
Daniel Diroff is a data scientist/machine learning researcher at Akvelon. Daniel has published more papers on Akvelon’s work, including his original paper on Bitcoin Coin Selection with Leverage which you can read here.