Non-acted multi-view audio-visual dyadic interactions. Project non-verbal emotion recognition in dyadic scenarios and speaker segmentation

Lázaro Herrasti, Pablo

Please use this identifier to cite or link to this item: http://hdl.handle.net/2445/159557

Title:	Non-acted multi-view audio-visual dyadic interactions. Project non-verbal emotion recognition in dyadic scenarios and speaker segmentation
Author:	Lázaro Herrasti, Pablo
Director/Tutor:	Escalera Guerrero, Sergio Palmero, Cristina
Keywords:	Emocions Aprenentatge automàtic Treballs de fi de màster Emotions Machine learning Master's theses
Issue Date:	2-Sep-2019
Abstract:	[en] In particular, this Master Thesis is focused on the development of baseline Emotion Recognition System in a dyadic environment using raw and handcraft audio features and cropped faces from the videos. This system is analyzed at frame and utterance level without temporal information. As well, a baseline Speaker Segmenta- tion System has been developed to facilitate the annotation task. For this reason, an exhaustive study of the state-of-the-art on emotion recognition and speaker segmentation techniques has been conducted, paying particular attention on Deep Learning techniques for emotion recognition and clustering for speaker aegmentation. While studying the state-of-the-art from the theoretical point of view, a dataset consisting of videos of sessions of dyadic interactions between individuals in different scenarios has been recorded. Different attributes were captured and labelled from these videos: body pose, hand pose, emotion, age, gender, etc. Once the ar- chitectures for emotion recognition have been trained with other dataset, a proof of concept is done with this new database in order to extract conclusions. In addition, this database can help future systems to achieve better results. A large number of experiments with audio and video are performed to create the emotion recognition system. The IEMOCAP database is used to perform the training and evaluation experiments of the emotion recognition system. Once the audio and video are trained separately with two different architectures, a fusion of both methods is done. In this work, the importance of preprocessing data (face detection, windows analysis length, handcrafted features, etc.) and choosing the correct parameters for the architectures (network depth, fusion, etc.) has been demonstrated and studied. On the other hand, the experiments for the speaker segmentation system are performed with a piece of audio from IEMOCAP database. In this work, the prerprocessing steps, the problems of an unsupervised system such as clustering and the feature representation are studied and discussed. Finally, the conclusions drawn throughout this work are exposed, as well as the possible lines of future work including new systems for emotion recognition and the experiments with the database recorded in this work.
Note:	Treballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona, Any: 2019, Tutor: Sergio Escalera Guerrero i Cristina Palmero
URI:	http://hdl.handle.net/2445/159557
Appears in Collections:	Programari - Treballs de l'alumnat Màster Oficial - Fonaments de la Ciència de Dades

Files in This Item:

File	Description	Size	Format
159557.pdf	Memòria	18.14 MB	Adobe PDF	View/Open
codi_font.zip		1.99 MB	zip	View/Open

Show full item record

This item is licensed under a Creative Commons License