Speech Enhancement Method Based on Modified Encoder-Decoder Pyramid Transformer
https://doi.org/10.15514/ISPRAS-2022-34(4)-10
Abstract
The development of new technologies for voice communication has led to the need of improvement of speech enhancement methods. Modern users of information systems place high demands on both the intelligibility of the voice signal and its perceptual quality. In this work we propose a new approach to solving the problem of speech enhancement. For this, a modified pyramidal transformer neural network with an encoder-decoder structure was developed. The encoder compressed the spectrum of the voice signal into a pyramidal series of internal embeddings. The decoder with self-attention transformations reconstructed the mask of the complex ratio of the cleaned and noisy signals based on the embeddings calculated by the encoder. Two possible loss functions were considered for training the proposed neural network model. It was shown that the use of frequency encoding mixed with the input data has improved the performance of the proposed approach. The neural network was trained and tested on the DNS Challenge 2021 dataset. It showed high performance compared to modern speech enhancement methods. We provide a qualitative analysis of the training process of the implemented neural network. It showed that the network gradually moved from simple noise masking in the early training epochs to restoring the missing formant components of the speaker's voice in later epochs. This led to high performance metrics and subjective quality of enhanced speech.
Keywords
About the Authors
Andrey Aleksandrovich LEPENDINRussian Federation
Candidate of science in physics and mathematics, associate professor of the Department of information security of the Institute of Digital Technology, Electronics and Physics
Rauf Salavatovich NASRETDINOV
Russian Federation
Post-graduate student of the Department of information security of the Institute of Digital Technology, Electronics and Physics
Ilya Dmitrievich ILYASHENKO
Russian Federation
Post-graduate student of the Department of information security of the Institute of Digital Technology, Electronics and Physics
References
1. Loizou P. Speech Enhancement. Theory and Practice, 2nd Edition. CRC Press, 2017. 711 p.
2. Reddy C., Dubey H., Koishida K. Interspeech 2021 Deep Noise Suppression Challenge. arXiv preprint arXiv: 2101.01902, 2021, 5 p.
3. Dubey H., Gopal V., Cutler R. et al. ICASSP 2022 Deep Noise Suppression Challenge. DNS C, 2022. arXiv preprint arXiv: 2202.13288, 2022. 5 p.
4. Borgström B.J., Brandstein M.S., Dunn R.B. Improving Statistical Model-Based Speech Enhancement with Deep Neural Networks. In Proc. of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 2018, pp. 471-475.
5. Zheng C., Liu W. et al. Low-latency Monaural Speech Enhancement with Deep Filter-bank Equalizer. The Journal of the Acoustical Society of America, vol. 151, issue 5, 2021, article no. 3291.
6. Umut I., Giri R. et al. PoCoNet: better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss. In Proc. of the INTERSPEECH 2020, 2020, pp. 2487-2491.
7. Hao X., Su X. et al. FullSubNet: a full-band and sub-band fusion model for real-time single-channel speech enhancement. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021, pp. 6633-6637.
8. Xu, R., Wu, R. et al. Listening to sounds of silence for speech denoising. In Proc. of the Conference on Neural Information Processing Systems (NeurIPS 2020), 2020, pp. 9633-9648.
9. Zheng C., Peng X., Zhang Y. Interactive Speech and Noise Modeling for Speech Enhancement. arXiv preprint arXiv: 2105.05537, 2020, 9 p.
10. Luo, Y., Mesgarani, N. Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, issue 8, 2019, pp. 1256-1266.
11. Liu Y., Zhang H. et al. Supervised speech enhancement with real spectrum approximation. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 5746-5750.
12. Williamson D.S., Wang Y., Wang D. Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, issue 3, 2016, pp. 483-492.
13. Tan K., Wang D.A. Сonvolutional recurrent neural network for real-time speech enhancement. 2018, 2018, pp. 3229-3233.
14. Tallec C., Ollivier Y. Can recurrent neural networks warp time? In Proc. of the International Conference on Learning Representation, 2018, pp. 1-13.
15. Vaswani A., Shazeer N. et al. Attention is all you need. In Proc. of the Conference on Neural Information Processing Systems (NIPS 2017), 2017, 11 p.
16. Wang W., Xie E. et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, , pp. 548-558.
17. Cao H., Wang Y. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv: 2105.05537, 2020. 14 c. DOI: 10.48550/arXiv.2105.05537
18. Kun Wei, K., Guo P., Jiang N. Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism, arXiv preprint arXiv: 2207.00883, 2022, 5 p.
19. Wang Y., Shen G. et al. Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition. In Proc. of the INTERSPEECH 2021, 2021, pp. 4518-4522.
20. Wu, C., Xiu, Z. et al. Transformer-based acoustic modeling for streaming speech synthesis. c INTERSPEECH 2021, 2021, pp. 146-150.
21. Ronneberger O., Fischer P., Brox T. U-Net: convolutional networks for biomedical image segmentation. Lecture Notes in Computer Science, vol. 9351, 2015, pp. 234-241.
22. Lin T, Dollár P., Girshick R. Feature Pyramid Networks for Object Detection. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 936-944.
23. Roux J.L., Wisdom S. et al. SDR – Half-baked or well done? In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 626-630.
24. Naderi B., Cutler R. An opensource implementation of ITU-T recommendation P.808 with validation. In Proc. of the INTERSPEECH 2020, 2020, pp. 2862-2866.
25. Taal C.H., Hendriks R.C. et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proc. of the IIEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4214-4217.
26. Nasretdinov R., Ilyashenko I., Lependin A. Two-Stage Method of Speech Denoising by Long Short-Term Memory Neural Network. Communications in Computer and Information Science, vol 1526, 2021, pp. 86-97.
Review
For citations:
LEPENDIN A.A., NASRETDINOV R.S., ILYASHENKO I.D. Speech Enhancement Method Based on Modified Encoder-Decoder Pyramid Transformer. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2022;34(4):135-152. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-34(4)-10