The interest around emotion recognition has been developing since the beginning of 21st century and reached its peak around 2016. At the edge of 2015-2016 we saw numbers of russian and international startups and labs using neural networks in order to create entertainment and/or research apps (Affectiva, Eyeris, Beyond Verbal Communications etc.). We have evidenced the uprise of different ER apps and platforms which would analyse your photo or video, detect different emotions and give you the percentage of possible expressions on your face.

But all these apps have a uniting issue: in most cases they only consider facial expressions, failing to recognize various emotional states of people using the app. However, such parts as body-positioning, gestures, glance, voice, heartbeat are to be considered through a much greater number of channels, so right now we cannot really deal with emotion recognition on the highest possible perception levels. That’s why Neurodata Lab focuses on multimodal approach: we can use multichannel data collected, processed and analyzed simultaneously, totally in sync.

Right now the market of techonologies in the field of emotion recognition is experiencing a very fruitful uprise. According to various estimations of analysts, we will have dealt with $19 to $37 billions uprise by the year 2021. At the same time we can definitely say that Emotion Detection and Recognition Systems (EDRS) are very positively estimated. For instance, research company Markets&Markets has stated that current estimation of the worldwide market has reached $6,72 billion peak in 2016 and would reach $36,06 billion peak by 2021, considering 39,9% yearly increase.

The most relevant areas for investing is still Asia-Pacific region, North America (US and Canada) and EU. The two most attractive fields in development of EDRS are face microexpressions and biosensors installed into portable devices. The less but not the least important in EDRS are voice and speech recognition and eye-tracking.

Neurodata Lab has created created a multimodal dataset of play-acted affective dyadic interactions in Russian (RAMAS). It includes audio, video, skeleton data, electrodermal activity and photoplethysmogram data recorded from 10 professional actors.

Actors were given scenarios, which considered six basic emotions: anger, happiness, fear, sadness, disgust, surprise. All records were annotated to select fragments containing reliable emotions.

Some preliminary speech signal analyses we performed include speaker-independent support vector machine classification to differentiate between two emotional states (happiness and sadness). The classification was performed using both gender-dependent and gender-independent models. We achieved 93% accuracy for the female voice, 92% for male and 89% for mixed-gender voice. Therefore, the classification accuracy with the gender-dependent approach showed a 3% (for male) and 4% (for female) improvement compared to the gender-independent one.

The designed RAMAS dataset may be used for automatic emotion recognition, gesture, pose and mimic analysis, along with existing datasets.