The data to compare
We compared the algorithms on the basis of the available test datasets. For this purpose we took the emotion recognition technologies developed Microsoft and Affectiva, then added Amazon, and, of course, our own Neurodata Lab’s emotion recognition unit. All of them recognize emotions by analyzing facial expressions. We used affective video data marked discreetly since all the compared algorithms were taught to deal with emotion labels, that is — different types of emotions. We chose SAVEE, AFEW, and RAVDESS, that was released only a year ago.
We usually take for granted the fact, unless we have to deal with the matter professionally, that some technology is more accurate than the others. Though it is interesting to understand how an emotion recognition algorithm is trained to perform this way or another.
In machine learning tasks the usual practice is to create a dataset containing the examples of the objects an algorithm has to classify, for instance, cars or chairs. For emotion recognition purposes, it is common to record the actors playing certain conversational situations and imitating emotions — emotional expressions that are acted and thus can be unnatural or exaggerated. Algorithms trained on the acted data work with low accuracy on ‘in-the-wild’ data. When an emotion is played by an actor, we cannot be sure if she does it right, as people would really do. Particularly for this reason, natural data are so valuable for developers, but at the same time difficult to work with due to the background noise, or any other possible limitations, such as large variability in the emotional expression, missing or obstructed channels (the face unseen, or the voice not clearly heard).
In any case, the data in the sets are represented by short audiovisual fragments in each of which one particular emotion is expressed. How do we know which one it is? Quite a lot of people, so-called ‘annotators’, watch each fragment and manually indicate what emotion is in it. The results of such procedure might differ depending on the annotator’s cultural background, as the patterns of emotional expression in these cultures might differ as well.
Thus, the accuracy of an emotion recognition algorithm will be determined as the difference between the emotion predicted by the machine, and the emotion indicated by annotators. Today most algorithms have learned how to distinguish among 6 emotions — happiness, sadness, fear, disgust, anger, surprise, — and a neutral state. These are sometimes called ‘basic’ emotions (which is actually a myth).
Each of the datasets, SAVEE, AFEW, and RAVDESS, includes from 480 to 1440 fragments with these emotions. With SAVEE and RAVDESS being the acted datasets, for this article we will specifically concentrate on the results and samples from AFEW, containing the fragments from the most famous and well-acted movie scenes of contemporary cinema. Even though acted, these scenes are not refined and are as close to real life as possible.
The comparison results
We examined each of the 7 affective states: 6 emotions and a neutral state. It turned out that some of the emotions were relatively easy to detect for all the 4 algorithms, while others were quite difficult for most of them. All in all, some performed better than the others.
Again, we took:
These algorithms work on the principle of the single-frame analysis. They separate the video stream into frames and detect emotions in each of them as if they had to deal with single images.
The results for the three datasets are in the table below, for SAVEE, AFEW, and RAVDESS respectively.
Table 1. The plus indicates that in general in most videos the algorithm detected the right emotion, in more than half the video fragment length (F-score > 0.2). The highlighted plus indicates the best result among the four algorithms. The results are the average F-score for all video fragments for each emotion category in the datasets.
In the strictest sense of the word, we didn’t measure the detection accuracy. In the emotion recognition task, the algorithms had to classify emotional states according to the 7 categories. They could do so with some precision and recall. Accuracy is a weighted arithmetic mean of precision. Instead, we measured F-score that combines both precision and recall. You can read a very nice Wikipedia entryon that.
Also, if the algorithms randomly guessed the emotions, the F-score would equal 0.14. For our purpose, we set the F-score benchmark as 0.2, which is higher than the random guess.
The top-3: Happiness, Sadness, Surprise
Happiness, Sadness, Surprise were among the best detectable emotions. This can be because their expressive manifestations are quite intense and clearly distinguishable — a smile or an open mouth. On the other hand, these emotional categories are usually better represented in the datasets — there are simply more data with happy, sad, or surprised people.
Microsoft definitely carries the palm in the three categories. Neurodata Lab and Amazon keep up with the leader, while Affectiva didn’t perform very well (it actually could detect any emotions at all only in 2/5 of the AFEW files).
Tough nuts: Anger, Neutral, Disgust, Fear
These four were among the emotions with the most erroneous performances. Affectiva has done well with the acted expressions of Disgust, with Neurodata Lab, Amazon and Microsoft performing well in the RAVDESS dataset. At the same time, no algorithms were able to cope with the naturalistic affective data of the AFEW dataset. Almost all algorithms have coped with Anger, except for Affectiva. Some of the algorithms were occasionally good with Neutral, especially Microsoft, but only Neurodata Lab managed to correctly recognize Fear.* Microsoft has performed the best with the emotionless neutral faces, while Affectiva does not have this affective category at all.
*We should note that Amazon does not detect Fear but Confusion, while Neurodata Lab recognizes Anxiety instead.
The emotional rating
Let’s now have a look at what was confused with what. Since we mostly gave examples from the AFEW dataset, we would now illustrate these results via the Confusion Matrices (a short instruction to understand these better). On the left side of each matrix, there are the actual emotions expressed by the people, at the bottom are predicted emotions — the results of the algorithm’s work. The darker the square is, the more predictions were made in that particular category.