Paper Key : IRJ************336
Author: Mahesh Madhukar Thorat,Kunal Ramesh Barthune ,Dr. Mahender Kondekar
Date Published: 03 Apr 2025
Abstract
Speech Emotion Recognition (SER) is a crucial aspect of human-computer interaction, enhancing applica-tions in virtual assistants, telemedicine, and mental health monitoring. This study develops a hybrid CNN-LSTM model for detecting emotions from speech, leveraging advanced feature extraction techniques and data augmentation. By integrating datasets such as RAVDESS, CREMA-D, TESS, and SAVEE, the model achieves over 90% accuracy, surpassing traditional classifiers like SVM and Random Forest. Feature en-gineering using MFCCs, chroma features, zero-crossing rate, and spectral contrast significantly enhances classification performance. Data augmentation techniques, including noise injection, pitch shifting, and time stretching, improve robustness, increasing accuracy from 81.2% to 87.6%. However, challenges re-main in cross-dataset generalization, necessitating domain adaptation techniques. Future research should focus on multimodal emotion recognition, integrating facial expressions and physiological signals, and exploring Transformer-based architectures and federated learning for secure and scalable SER applica-tions. This study contributes to advancing affective computing and real-world emotion-aware AI systems.
DOI Requested