Computer captures individual attention

Originally published at Industry of Things (in German) on November 18, 2018

Facial expressions and gestures are important. But whether one person listens to another or not, cannot be understood from the speech itself. Researchers from Ulm and Moscow came to this conclusion, which is by no means self-evident, during the experiments on automated attention recognition (Affective Computing).

Man has an unmistakable sense of whether his counterpart is listening attentively or not. This is because human facial expressions, gestures and body language are quite revealing, at least for humans. Researchers from Ulm and Moscow have investigated by means of which features a computer can capture a person’s attention during a conversation. The system was trained with more than 26,000 video fragments. The result: the listener reveals the most about his or her “engagement” through his or her speech.

Affective Computing can be used in many ways

Automatic emotion recognition is an equally innovative and lucrative field in computer science. It can be applied in highly automated driving, advertising industry, digital medicine or many other fields of human-computer interaction. Already the programs that are used today are more or less able to analyze human emotional life.

These include not only the parameters of emotional well-being, but also those of attention and compassion. “We have now investigated what features and methods are most revealing to the computer in order to find out whether people are actively involved in a listening situation or not,” explains Dmitrii Fedotov. The system analyst attained a doctorate from Professor Wolfgang Minker at the Institute of Communications Engineering, the University of Ulm. The 25-year-old was born and raised in Krasnojarsk (Siberia), where he studied at the Reshetnev Siberian State University of Science and Technology. Two years ago, Fedotov came to Ulm from this highly renowned Russian university, which is one of Ulm University’s five strategic partner universities (U5).

For this research project, Dmitrii Fedotov cooperated closely with three Moscow scientists from Neurodata Lab. The young company, with offices in Italy, Switzerland, Russia and the USA, specializes in Artificial Intelligence research, Affective Computing and Data Mining.

For the project, the Neurodata Lab has assembled a huge dataset from video material on a so-called EmotionMiner platform. Scene after scene was systematically marked up “by hand” according to certain criteria. What emotions do speakers and listeners show? Is the listener attentive or unfocused? In total, more than 26,000 fragments from 981 videos were processed. The short sequences, which are around four seconds long, show humans in communicative situations and come from publicly accessible video recordings of conversations, interviews, debates and talk shows in English. Each video sequence was examined by ten human analysts. About 1,500 people were involved in the analysis.

Software recognizes and analyzes emotions in video sequences

And why all the effort? “You need these data collected by humans as reference data to later find out how well the computer is able to capture human emotions and mental states,” explains Olga Perepelkina. The psychologist is Chief Research Officer at Neurodata Lab and was involved in this German-Russian joint project together with Evdokia Kazimirova and Maria Konstantinova. All three scientists also hold a degrees in psychology from Lomonosov Moscow State University (MSU).

The real challenge in Affective Computing lies in the technical implementation of the automatic emotion and attention recording itself. How do you get the computer to form an idea of whether a person shown there is an active listener or rather uninvolved on the basis of the video material? The scientists use the term pair ‘engagement – disengagement’ to describe the extent of mental involvement.

In recent years, several methods have been established for automatic attention recognition (Affective Computing) in order to capture mimic and gestural cues as well as postures. In simple terms, it examines lip or eye movements, facial expressions or the emotional colouring of spoken language (“audio” factor).

More precisely, we are talking here about the use of software tools that are capable, for example, of automatically analyzing the emotions of speakers and listeners in video sequences. Or they are algorithms that are able to calculate the probability with which someone will begin to speak at the next moment from the movement of the lips. For face recognition alone, the researchers gave a neural network with the image data of more than 10,000 faces.

“Lips” and “audio” factors provide some good correlations.

“We wanted to find out which combination of modalities is the most effective in automatic attention recognition,” says Fedotov. The scientist from Ulm has statistically combined all possible double and triple combinations of five different recognition modalities (eyes, lips, face, body and audio). The result: The two-fold combination of “lips” and “audio” proved to be most effective in relation to the effort involved.

A good 70 percent of all cases can be assigned correctly; a result that is really good for automated attention recognition (Affective Computing). “Both characteristics are directly related to the act of speaking. The explanatory power of the “audio” factor alone was already considerable. In practice, this means that automatic attention recognition, which focuses on the auditory characteristics of spoken language – voice quality, tone spectrum, voice energy, speech flow and pitch – is sufficient to reliably tell whether the listener is attentive. When the listener is silent, other features such as facial and body movements help to identify “engagement” or “disengagement”.

The study was presented in autumn 2018 at a major international conference (ICMI 2018) in Boulder, Colorado.