Cross-Corpus Speech Emotion Recognition Using HuBERT Model, Speaker Embeddings, and Prosodic Features

Document Type : Original Article

Authors

1 computer engineering department, K.N.Toosi university of technology

2 دانشکده مهندسی کامپیوتر - دانشگاه صنعتی خواجه‌نصیرالدین طوسی

Abstract

This study investigates the challenges and methodologies in cross-corpus speech emotion recognition (CCSER), focusing on the generalization of speech features across diverse linguistic, speakers, and emotional contexts. We propose a novel SER system that leverages the transformer blocks of the HuBERT model combined with speaker embeddings and prosodic features to enhance feature extraction for emotion classification across different datasets. Our approach addresses dataset variability by utilizing transfer learning techniques, particularly through unsupervised methods that adapt feature distributions without requiring labeled data from target domains. Specifically, our transfer learning strategy employs a clustering method to select the most appropriate trained model for performing transfer learning from the source to target domains. We evaluate our proposed model using several datasets, including IEMOCAP as the source domain, and extend our validation to emotional datasets with different languages, demonstrating the adaptability of our system. The results indicate significant improvements in emotion recognition accuracy compared to traditional methods, highlighting the effectiveness of integrating advanced self-supervised learning models and transfer learning strategies in CCSER tasks.

Keywords