원문정보
초록
영어
Automated speech emotion recognition (SER) by efficient long-term temporal context modeling is a challenging task of the digital audio signal processing domain. However, by default, the recurrent neural network (RNN) is employed to incorporate the temporal dependencies in sequence to investigate the relationships among sequences and features. In this study, we design a parallel convolutional neural network (PCNN) for SER by using a squeeze and excitation network (SEnet) with the self-attention module. Additionally, we adopt the residual learning strategy in both module, SEnet and self-attention, which is further improve the performance of the network. Our proposed SER system utilizes speech spectrogram as input and extracts utterancelevel discrete features by using the PCNN model. We experimentally evaluated our proposed system by standard speech corpus, interactive emotional dyadic motion capture (IEMOCAP). The prediction result reveals the significance and robustness of the proposed PCNN system, which obtained a high recognition rate of 72.01% over state-of-the-art (SOTA) methods.
목차
I. INTRODUCTION
II. PROPOSED PCNN-BASED SER SYSTEM
A. SEnet Module
B. Self-Attention
III. RESULTS & DISCUSSION
IV. CONCLUSION & FUTURE DIRECTION
ACKNOWLEDGEMENT
REFERENCES