Multimodal Emotion Recognition Challenge

Abstract
Most of the studies on emotion recognition problem are focused on single-channel recognition or multimodal approaches when the data is available for the whole dataset. However, in some practical cases, data sources could be missed, noised or broken. In this challenge we present the first machine learning competition on multimodal emotion recognition with missing data. The main goal of this challenge is to find a solution for reliable recognition of emotional behavior when some data is unavailable. We describe the problem, present an emotion dataset and suggest a baseline solution for a given data. It is based on naive decision-level data fusion via recurrent neural networks (Long short-term memory, LSTM). We classified 4-seconds intervals, for which features for missed data is replaced with zeros. The naive approach demonstrated around 52,5% weighted accuracy on a 6-class problem (angry, sad, disgusted, happy, scared and neutral state). We also compare the performance of different sets of modalities.

Problem description
In the recent years, human-computer interaction got more and more attention from scientists. Today most of the speech processing applications can understand what is said, but often an information about how is necessary. Variety of emotion expression process shows that there are many ways to describe global and local speech properties, and one of the most effective is emotions.

There is no common definition of human emotions. In most of the papers a set of states like angry or scared is considered [see, for example, 1]. Each emotion from this set can be demonstrated via voice intonation, gestures, mimic and gaze. Often in spontaneous natural actions their intensity could be different, and the most complex cases contain modalities pointing at different emotions.

However, the only two modalities could be considered as main for emotion recognition. Usually, most of the emotional information is contained only in voice and mimic. Other modalities play in emotion expression only complementary role.

Emotion expression is a dynamic process. For low intensity cases emotionality could be contained in a few phonemes from a sentence or in face microexpressions, and the rest seems to be neutral. Moreover, sometimes human affect state is more complex than one basic state.

These facts determine the most effective way to annotate video fragments correctly. This procedure is needed to have a valid labeled dataset which is applicable to machine learning techniques.

Annotation
Many annotators and flexible emotion expression bounds are needed for a valid labeling of a video fragment. Each annotator marks timestamps in a video when the emotion is expressed. The count of emotions could be more than one. Intensity threshold is chosen by annotators subjectively. As the result, we get averaged labels from several annotators.

Instead of continuous annotation sentence labeling is frequently used. Videos are divided into semantically ended parts. All annotators choose a set of emotions expressed in these sentences. This type of labeling was used in IEMOCAP emotion corpus [9], which is similar to the one used in the proposed challenge.

To solve the problem of emotion recognition is to predict emotion labels (one or more) for a given video fragment corresponding to annotation process. A review of emotion recognition from speech methods could be found, for example, in the publication from 2012 [4]. There are also several publications focused on facial expressions [5, 6], eye gaze [7] and multimodal problem statements [10].

Objective
In this challenge you need to predict only one emotion label for each frame in the whole video. The predictive model should be trained on estimated features for 4 modalities: voice, body gestures, face, and eyes. A frequency of expected labels should be 100 frames per second.

Data description
Our dataset consists of about 4 hours of video material. It is divided into 6 sessions with 2 actors playing different scenarios from social and corporate life, friends conversation, etc. Each scenario has an average length of 40 seconds and was acted in several takes.

There are two actors in each take. We collect features for actors separately. The video stream, Kinect data and audio features sequences have different frames per second. That why we present you with several tables with feature sequences for each modality. Eye gaze and face features are extracted from video, but we separate it into different modalities.

For audio data two types of features were estimated:

  1. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing (see open-access paper http://ieeexplore.ieee.org/document/7160715/ for more information);
  2. 13 MFCC-coefficients including 0-th.

Features were obtained with openSMILE feature extraction tool (http://audeering.com/technology/opensmile/) using default config files for eGeMAPS and for MFCC. If an actor kept silent features were replaced with zeros.

There were 27 features extracted from skeleton data. Motion feature extraction was performed using R 3.4.0 and EyesWeb XMI 5.7.0.0 (http://www.infomus.org/eyesweb_ita.php). Features were computed from the 3D coordinates of the body’s joints captured with Kinect v. 2. The movement features were as follows: distances between defined joints (hands, hip, spine, head); certain joints’ velocities, accelerations, and jerks; smoothness, curvature, and density indexes; symmetry of the hands’ joints and kinetic energy (for more information see e.g. http://geniiz.com/wp-content/uploads/2016/08/D4.2_-_Implementation_of_EyesWeb_low-level_features.pdf).

Eye-based features contain 6 numbers extracted from video: X and Y axis offsets of left and right eyes, and open and fixation boolean flags.

We extract mimic features from each frame of video using the following algorithm: detect face -> extract 4096 features from VGG16 neural network [3]-> PCA (principal component analysis) to 100 features [2]. Magic number 100 is chosen experimentally to make feature vector length comparable with features from other modalities and to avoid losing a large amount of data.

Overall frames per second are different for all modalities. One second of video is described by 100 audio, 15 kinect and 50 face and eyes feature vectors.

The whole dataset is divided into train, public test and private test sets in the following proportions: 2/1/1 for the first 5 sessions and 0/1/2 for the last session. It means that test set contains the data on certain actors which is not included into the train set. It’s done to validate algorithms in terms of subject independent problem statement.

For the train set annotators agreement scores are also available. We estimate it as a fraction of a number of agreed annotators and the whole number of assigned labels for the frame. For example, if annotator labels are [angry], [angry], [angry], [angry, scared], [scared], [angry, disgusted], then the agreement score will be 5 / 8 = 0.675. It helps us to separate obvious emotions from more complex cases.

Missing Data
In the wild some modalities may be unavailable. For example, gestures information is not given for close-up videos, or, more frequently, there is no voice when an actor is listening the other. In some cases data could be broken: Kinect sensor doesn’t always work correctly. To emulate more complex cases we manually replace features with zeroes for the average length of 10 seconds in 40% of files. In addition to the missing data, about 5% of files lack completely.

Thus train, public and private test sets consist of 312, 143 and 126 takes respectively. A total of about 24% of the data is replaced with zeros. The main goal of this challenge is to learn how to train machine learning models on different fps streams for modalities with missing data.

To exclude the possibility of cheating the raw data is not available.

Baseline solution for multimodal emotion recognition
The main goal of the proposed solution is to show how to work with missing data in regards to multimodal emotion recognition problem. It also defines the lowest bound of accuracy to consider the other solution as effective.

To make a prediction we consider every 4-second interval from video and make one prediction for all frames in it. Framewise we get emotion labels for the whole video through covering it with the predictions for 4-second intervals. One can use flexible interval lengths, but there is no automatic solution to bound short sentences or phrases from a long video.

As the basic model we used stacked bidirectional Long short-term memory (LSTM) recurrent neural network to learn features from sequences. Many modalities are fused on decision-level. For each modality their own branch with 2 bidirectional LSTM layers of 200 nodes learns high-level feature representation from a sequence of features, i.e. for audio data with the dimension of 36 and 400 frames length it estimates a single vector of 200 features. Then after concatenation of features from 4 channels, two stacked fully-connected dense layers with 100 and 6 nodes make the prediction for input 4 seconds of data. Missing data is replaced with zeros in the hope that neural network will learn not to take into account this sort of data. The last output has softmax nonlinearity to estimate probabilities to be the input point from the only one of emotion classes.

Weights are optimized by Rmsprop solver corresponding to categorical entropy loss minimization.

Trained on the train set, proposed baseline approach achieved 52,5% weighted accuracy on the public test set. In other words, for each frame in our model it chooses the right emotion from a set of angry, sad, disgusted, happy, scared and neutral, with 52,5% probability.

If we consider modalities separately, the performance on classification is lower. Using audio, mimic, body-motion and eyes features alone we got 32, 48, 36 and 24 percent weighted accuracy. As expected, this fact shows that multimodal approach outperforms models using the only one channel.

Submissions
The submission process will be held in two stages.

In the first stage, you would need to send predictions based on the public test set. Scores will form a leaderboard, the Top-50 solutions over baseline will be admitted to the second stage.

In the second stage, we will publish a private test set, and the final leaderboard will be formed by scores based on this set.

In addition to final prediction (submission) you would also need to submit a code used to make this prediction, instructions to run it and a report where your solution is described. It should contain information about model and feature selection and missing data handling. To avoid private test overfitting we require that your public leaderboard score matches the score estimated via the submitted model. The maximum absolute residual between leaderboard and model scores is 0.01. All solutions with mismatched scores will be discarded and user or teams who submitted them disqualified.

Submission templates will be available with train and public test sets. It contains .csv files with timestamps and zero-filled fields for emotions to predict. As in train set, you should place the only one label “1” in each row in files in prediction/*.csv, and send the .zip archive via submission form at our site.

Evaluation metric
To measure the performance of suggested solutions we use weighted accuracy metric estimated on all frames in videos for which agreement score is over 0.6. Submission should be in the same format as train labels files, and a template for each video will be also available.

Good luck!

References

  1. Ekman, T. Dalgleish, “Basic Emotions” in Handbook of Cognition and Emotion, Wiley, 1999
  2. Bishop, “Pattern Recognition and Machine Learning”, Information Science and Statistics, Springer-Verlag New York, 2006
  3. Karen Simonyan, Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, https://arxiv.org/abs/1409.1556
  4. Shashidhar G. Koolagudi, K. Sreenivasa Rao, “Emotion recognition from speech: a review”, International Journal of Speech Technology, June 2012, Volume 15
  5. Ariel Ruiz-Garcia, et al, “Deep Learning for Emotion Recognition in Faces”, Artificial Neural Networks and Machine Learning – ICANN 2016
  6. U. Kharat, S. V. Dudul, “Deep Learning for Emotion Recognition in Faces”, Human-Computer Systems Interaction. Advances in Intelligent and Soft Computing, vol 60. Springer, Berlin, Heidelberg
  7. W. Schurgin, et al, “Eye movements during emotion recognition in faces”, Journal of Vision (2014) 14(13):14
  8. George Caridakis, et al, “Multimodal emotion recognition from expressive faces, body gestures and speech”, IFIP The International Federation for Information Processing, vol 247. Springer, Boston, MA
  9. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, and S.S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, December 2008
  10. Soleymani, “Multimodal Emotion Recognition in Response to Videos”, IEEE Transactions On Affective Computing, Vol. 3, №2, 2012

This overview is the exclusive and proprietary property of Neurodata Lab LLC. Any unauthorized use or reproduction of this information without the prior written consent of Neurodata Lab is strictly prohibited. (c)2017, Neurodata Lab LLC. All rights reserved.