Emotion recognition in motion: why static images lie

The emotion recognition technology has been around for more than 10 years. Since recently, companies have begun to realize that in order to detect emotions correctly one needs to use a multimodal approach and analyze human emotions via several channels simultaneously: face, voice, physiology, movements. Even the simplest ones.

There are some pitfalls, of course. For example, you cannot say much about a sound at each individual moment, that’s why it’s better to analyze voice in dynamics. Not surprisingly, the same thing goes for visual channels. Take the analysis of facial expressions.

Today companies like Affectiva, Amazon, Microsoft analyze videos frame by frame. We also began our technological journey with this classic approach: separating the video into frames and detecting emotions in each of them, one by one. Yet from the scientific point of view, it’s not really a good idea. Why? Well, it eliminates the context which is vital for emotion recognition. And it does have its consequences.

In this article we will consider the most interesting experiments on the nature of facial emotional expression and perception and will try to show why and how we сan teach computers to analyze emotions in dynamics.

The science behind emotion dynamics

Historically facial expressions have been studied while presented out of context — most studies on emotional perception are conducted using static faces as stimuli. In other words, scientists ask people to look at static images of several emotional expressions isolated from the others and indicate which expressions they have seen in the images.

In 2013, Krumhuber, Kappas, Manstead decided to review what role dynamic features play in the perception of facial emotional behavior. Their study showed that it is easier to distinguish among facial expressions when they are observed dynamically rather than when the subject is presented with static images. What’s more — people tend to react more intensely to dynamic facial expressions than to those frozen in separate images. Electromyographic experiments, for example, showed that dynamic facial expressions provoke a more intense mimic response in the observers and are associated with higher physiological activation. The more dynamic the face, the more intense the reaction.

This result is not really surprising. Our facial expressions are always accompanied by verbal and non-verbal cues that carry their own semantic unit and demand their own analysis. One image cannot reflect these shades of meaning because it is taken out of context.

In most cases, observing dynamic changes in the face contributes to a more accurate recognition of emotions. As was shown by an experiment conducted by Tobin, Favelle, Palermo in 2016, we process both static and dynamic facial expressions holistically, but there are so-called analytical strategies stating that we also analyze facial features taken separately. In such cases, the motion may help us and emphasize or disambiguate emotion recognition. We better interpret any specific signs on the face that are associated with emotions when their movement is observed.

How long is an emotion?

Actually even with less than 500 ms of a facial micro-expression, typical development of an expression contains the phases of onset, apex and offset. Static images most often use “apex” of the emotion — its peak moment. It is believed that the peak moment of emotion experience corresponds to the highest point of mimic expression.

However, emotions are tricky and it is easy to presume one thing based on an isolated image while in fact, it will mean the whole other thing. For example, we tend to associate an open mouth and dilated eyes with surprise, though in certain contexts it might as well be fear or many things really, depending on a situation.

Typical development of a facial expression with onset, apex and offset. Credit: Bernin et al. (2018).

Aren’t dynamics just a number of static images?

It could have been true, but no. Not really.

There is a concept that since the video is a series of static images, emotion recognition can be improved by the actual increase in static information. However, this explanation does not work: an experiment by Ambadar, Schooler, and Cohn (2005) showed that the identification of facial expressions was significantly higher for videos than for a series of images.

In order to prove that, they asked people to indicate what emotional expressions they saw in two types of videos. The first type — dynamic videos — were usual recordings of facial expressions changing over time. They then made multi-static videos that contained the same number of frames as in the first type, but a “mask” between each frame disrupted the potential movement of facial muscles and therefore its perception. Observers were better at discriminating dynamic than multi-static videos but only when facial expressions were subtle. Thus, the dynamic sequence seemed to be a functionally different type of information that could not be reduced to additional static signals and existed only holistically.

Credit: Ambadar, Schooler, and Cohn (2005).

At the same time, when facial expressions were intense, the difference between recognition in videos and image series was leveled out. The expression contained enough information by itself and there was no need to provide extra information. As for less emotional faces, for example, conditionally neutral faces (typical for situations when you surf the internet), the role of motion observation went beyond the simple detection of changes in the face.

One more study is also illustrative in this respect. In 2008 Bould, Morris and Wink made an experiment in which they asked participants to recognize emotions in videos where only the first, neutral and last peak frames (even if the faces were not very expressive) were available. Subjects recognized emotions much worse in such videos than when the same emotions were observed in the full versions of video fragments. Perceiving the way in which a facial expression changes give the advantage to the observer. We are highly sensitive to temporal changes where motion signals are detected, and it is particularly important for early recognition of facial expressions.

That’s why in reality we may understand what the person might feel of what he or she is going to say next. We read the non-verbal cues off them and the puzzle gets together. While having 3 static images of the different parts of emotion may only lead us in a certain direction yet not provide the essential information that helps us identify the emotion for sure.

Can AI understand emotions in natural dynamics?

As we have already mentioned, emotional expression is not a one-dimensional notion but manifests itself through certain channels. Though computers can analyze them and extract more information they still lack the understanding of general context. Unlike humans, they don’t have this abstract grasp of the situation that might give them the key. However, it is possible to teach it to them.

One of the ways to do that — learning how to track emotions that are gradually changing over time. For this, scientists use the so-called recurrent neural networks. In such networks, “neurons” do not only transmit information to the others but exchange it with each other: for example, in addition to a new piece of incoming data, the neuron also receives some information about the previous state of the network.

For visual analysis of facial expressions — for a sequence of frames (that is, a video), the algorithm will predict which emotions are present at a given point in time, depending on what it has seen before—taking into account the context. Unfortunately, simple recurrent networks tend to remember too many previous data, and as emotions that change rapidly, this can lead to distorted results. Today, to solve this problem, emotion recognition algorithms designed to process dynamic information work with LSTM (long-term short-term memory) networks that can control incoming information about a facial expression in a way to choose only relevant data.

Single Frame (up) vs. Multi Frame (down) Emotion Analysis by Neurodata Lab

We express emotions with our face, body and voice. The only way to correctly detect human emotions for machines is to analyze expressions from each of these channels.In visual analysis of affective data, single frame analysis is commonly used. Here the system separates the video stream into frames and detects emotions in each of them. This approach may work with simple emotions – they change relatively fast. Though in this case we may get frames where a lifted corner of the lips does not mean a happy smile, but an open mouth during the speech, so are widened eyes — not necessarily a signal of fear.In Neurodata Lab we brought #EmotionRecognition closer to real life and use dynamic analysis instead. Here the conclusion the system draws about what emotions people express is based not on the analysis of each separate frame, but on the combined analysis of several seconds of the video. (Moreover, emotions in audio can be only analyzed this way.)This approach allows to track how emotions are gradually changing over time. It also allows for better and more accurate results and even more so, brings us closer to the understanding of how the human emotions work. Have a look at these videos and see for yourself how the both algorithms work: single-frame (up) vs. multi-frame (down).

Опубликовано Neurodata Lab Суббота, 13 октября 2018 г.

Today we in Neurodata Lab use dynamic analysis of facial expressions, where the conclusion the system draws about what emotions people express is based not on the analysis of each separate frame, but on the combined analysis of several seconds of the video. Have a look at these videos and see for yourself how both algorithms work: single-frame (up) vs. multi-frame (down). Credit: Neurodata Lab.

It allows for better and more accurate results and even more so, brings us closer to the understanding of how human emotions work. With the future introduction of more advanced analysis and understanding of the context, complex cognitive states, hidden, mixed and fake emotions would cease causing problems for computers.

This will bring human-computer interaction to a fundamentally new level. Sure, it will take some time, but it’s worth it. The AI that fully understands how and why we express particular emotions can be beneficial in every area from banking to healthcare. It can be life-changing.


References

Ambadar, Z., Schooler, J., & Cohn, J. (2005). Deciphering the enigmatic face: The importance of facial dynamics in interpreting subtle facial expressions. Psychological Science, 16, 403–410.

Bernin, A. et al. (2018). Automatic Classification and Shift Detection of Facial Expressions in Event-Aware Smart Environments. PETRA ’18 June 23–26, 2018, Corfu Island, Greece.

Bould, E., Morris, N., & Wink, B. (2008). Recognising subtle emotional expressions: The role of facial movements. Cognition & Emotion, 22, 1569–1587.

Krumhuber, E. G., Kappas, A., & Manstead, A. S. R. (2013). Effects of Dynamic Aspects of Facial Expressions: A Review. Emotion Review, 5(1), 41–46.doi:10.1177/1754073912451349

Tobin, A., Favelle, S., & Palermo, R. (2016). Dynamic facial expressions are processed holistically, but not more holistically than static facial expressions. Cognition and emotion, 30(6), 1208–1221.

Wen-Jing Yan et al. (2013). How Fast Are the Leaked Facial Expressions: The Duration of Micro-Expressions. Journal of Nonverbal Behavior 37(4).

 

***

Authors: Elizaveta Zaitseva, SMM Specialist at Neurodata Lab; Mariya Malygina, Junior Research Scientist at Neurodata Lab.


You are welcome to comment on this article in our blog on Medium.