Machine learning models that can recognize and predict human emotions have become increasingly popular over the past few years. In order for most of these techniques to perform well, however, the data used to train them is first annotated by human subjects. Moreover, emotions continuously change over time, which makes the annotation of videos or voice recordings particularly challenging, often resulting in discrepancies between labels and recordings.

To address this limitation, researchers at the University of Michigan have recently developed a new convolutional neural network that can simultaneously align and predict emotion annotations in an end-to-end fashion. They presented their technique, called a multi-delay sync (MDS) network, in a paper published in IEEE Transactions on Affective Computing.

“Emotion varies continuously in time; it ebbs and flows in our conversations” Emily Mower Provost, one of the researchers who carried out the study, told TechXplore. “In engineering, we often use continuous descriptions of emotion to measure how emotion varies. Our goal then becomes to predict these continuous measures from speech. But there is a catch. One of the biggest challenges in working with continuous descriptions of emotion is that it requires that we have labels that continuously vary in time. This is done by teams of human annotators. However, people aren’t machines.”

As Mower Provost goes on to explain, human annotators can sometimes be more attuned to particular emotional cues (e.g., laughter), but miss the meaning behind other cues (e.g., an exasperated sigh). In addition to this, humans can take some time to process a recording, and thus, their reactions to emotional cues is sometimes delayed. As a result, continuous emotion labels can present a lot of variation and are sometimes misaligned with speech in the data.

Read the full article by clicking on the title link.