Deep Learning for an Automatic Transcription System Development
https://doi.org/10.15514/ISPRAS-2025-37(1)-9
Abstract
This paper presents a deep neural network architecture for automatic phoneme recognition in speech signals. The proposed model combines convolutional and recurrent layers, as well as an attention mechanism enriched with reference values of vowel formant frequencies. This allows the model to effectively extract local and global acoustic features necessary for accurate phoneme sequence recognition. Particular attention is paid to the problem of imbalanced phoneme frequency in the training dataset and ways to overcome it, such as data augmentation and the use of a weighted loss function. The reported results demonstrate the viability of the proposed approach, but also indicate the need for further model refinement to achieve higher accuracy and recall in the speech recognition task.
About the Author
Oksana Vladimirovna GONCHAROVARussian Federation
Cand. Sci. (Philology), Associate Professor, Head of the Scientific and Educational Center "Intellectual Data Analysis" of the Pyatigorsk State University, Associate Professor of the Department of Russian Language and Teaching Methods at Patrice Lumumba Peoples' Friendship University of Russia, Senior Researcher at the Laboratory of Linguistic Platforms of the Ivannikov Institute for System Programming of the Russian Academy of Sciences (technical support for research work) since 2024. Research interests: acoustic phonetics, prosody, sociolinguistics, natural language processing.
References
1. Shorten, C., Khoshgoftaar, T. M. A survey on Image Data Augmentation for Deep Learning // Journal of Big Data, 6(1):60, 2019. Доступно по ссылке: https://www.researchgate.net/publication/334279066_A_survey_on_Image_Data_Augmentation_for_Deep_Learning (Дата обращения 23.01.2025).
2. Cucchiarini, C., 1993. Phonetic transcription: a methodological and empirical study. Ph.D. thesis, University of Nijmegen, Nijmegen, The Netherlands. Доступно по ссылке: https://repository.ubn.ru.nl/bitstream/handle/2066/145701/mmubn000001_170795853.pdf (Дата обращения 23.01.2025).
3. Kisler T., Schiel F., Sloetjes, H. Signal processing via web services: the use case WebMAUS. // Digital Humanities Conference 2012. 2012. pp. 30-34. Доступно по ссылке: https://www.researchgate.net/publication/248390251_Signal_processing_via_web_services_the_use_case_WebMAUS (Дата обращения 23.01.2025).
4. McAuliffe M., Socolof M., Mihuc S., Wagner M., Sonderegger M., Montreal forced aligner: Trainable text-speech alignment using Kaldi // Proc. Interspeech, vol. 2017. 2017. pp. 498– 502.
5. Rosenfelder I., Fruehwald J., Evanini K., Yuan, J. FAVE (forced alignment and vowel extraction) program suite. 2011. Доступно по ссылке: http://fave.ling.upenn.edu (Дата обращения 23.01.2025).
6. Povey D., Ghoshal A., Boulianne G., Burget L., Glembek O., Goel N., Hannemann M., Motlíček P., Qian Y., Schwarz P., Silovský J., & Stemmer G., Vesel K. The Kaldi Speech Recognition Toolkit // IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, 2011. Доступно по ссылке: https://www.danielpovey.com/files/2011_asru_kaldi.pdf (Дата обращения 23.01.2025).
7. Young S., Evermann G., Kershaw D., Moore G., Odell J., Ollason D., Povey D., Valtchev V., Woodland P., The HTK book // Cambridge university engineering department, vol. 3, no. 175, pp. 12. 2002. Доступно по ссылке: https://www.danielpovey.com/files/htkbook.pdf (Дата обращения 23.01.2025).
8. Fromont R., Hay J. LaBB-CAT: an Annotation Store // Proceedings of the Australasian Language Technology Association Workshop 2012, 2012. pp. 113–117. Доступно по ссылке: https://aclanthology.org/U12-1015.pdf (Дата обращения 23.01.2025).
9. Uwe R. PermA and Balloon: Tools for string alignment and text processing // paper no. 346. 2012. doi: 10.21437/Interspeech.2012-509 (Дата обращения 23.01.2025).
10. Teytaut Y., Roebel A. Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice // Proceedings of Interspeech 2021, International Speech Communication Association, Aug 2021, Brno, Czech Republic. pp.61-65, 10.21437/interspeech.2021-1676. hal-03552964 Доступно по ссылке: https://hal.science/hal-03552964/file/1676anav.pdf (Дата обращения 23.01.2025).
11. Гончарова О.В. Артикуляционно-акустические характеристики безударных и ударных гласных на месте орфографического ‘a’ в речи носителей разных фоновариантов русского языка // Филологические науки. Вопросы теории и практики. 2024. Том 17. Выпуск 5. 2024. Volume 17. C. 1661-1668. Доступно по ссылке: https://philology-journal.ru/article/phil20240240/fulltext (Дата обращения 23.01.2025).
12. Веб-сайт https://lingvodoc.ispras.ru/dictionaries_all (Дата обращения 23.01.2025).
13. Boersma P., Weenink D. PRAAT: Doing phonetics by computer. 2024. Доступно по ссылке: https://www.fon.hum.uva.nl/praat/ (Дата обращения 23.01.2025).
14. Веб-сайт https://github.com/brainteaser-ov/textgrid (Дата обращения 23.01.2025).
15. Graves, A., Mohamed, A., Hinton, G. Speech recognition with deep recurrent neural networks // International Conference on Acoustics, Speech and Signal Processing. 2013. pp. 6645-6649. Доступно по ссылке: https://arxiv.org/abs/1303.5778 (Дата обращения 23.01.2025).
16. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. r., Jaitly, N., Kingsbury, B. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups // IEEE Signal Processing Magazine, 29(6). 2012. pp. 82-97. Доступно по ссылке: https://www.cs.toronto.edu/~hinton/absps/DNN-2012-proof.pdf (Дата обращения 23.01.2025).
17. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Polosukhin, I. Attention is all you need // Advances in Neural Information Processing Systems. 2017. pp. 5998-6008. Доступно по ссылке: https://arxiv.org/abs/1706.03762 (Дата обращения 23.01.2025).
18. Devlin, J., Chang, M. W., Lee, K., Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. pp. 4171–4186. Доступно по ссылке: https://aclanthology.org/N19-1423.pdf (Дата обращения 23.01.2025)
19. Cui, Y., Jia, M., Lin, T. Y., Song, Y., Belongie, S. Class-balanced loss based on effective number of samples // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. pp. 9268-9277. Доступно по ссылке: https://arxiv.org/abs/1901.05555 (Дата обращения 23.01.2025)
20. Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., Le, Q. V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition // Proc. Interspeech 2019. 2019. pp. 2613-2617. Доступно по ссылке: https://arxiv.org/abs/1904.08779 (Дата обращения 23.01.2025).
21. Sainath, T. N., Weiss, R. J., Senior, A., Wilson, K. W., Vinyals, O. Learning the speech front-end with raw waveform CLDNNs // Proc. Interspeech 2015. 2015. pp. 1-5. Доступно по ссылке: https://www.ee.columbia.edu/~ronw/pubs/interspeech2015-waveform_cldnn.pdf (Дата обращения 23.01.2025)
22. Graves A., Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures // Neural Networks, Volume 18, Issues 5–6. 2005. pp. 602-610. doi.org/10.1016/j.neunet.2005.06.042 (Дата обращения 23.01.2025).
23. Bahdanau, D., Cho, K., Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate // Proc. ICLR 2015. 2015. Доступно по ссылке: https://arxiv.org/abs/1409.0473. (Дата обращения 23.01.2025)
24. Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. SMOTE: Synthetic Minority Over-sampling Technique. // Journal of Artificial Intelligence Research. 16. 2002. pp. 321–357. Доступно по ссылке: http://dx.doi.org/10.1613/jair.953 (Дата обращения 23.01.2025).
25. Toshniwal, S., Bahdanau, D., Sagayama, S., Bengio, Y. Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition // Proc. Interspeech 2017. 2017. pp. 3532-3536. Доступно по ссылке: https://arxiv.org/pdf/1704.01631(Дата обращения 23.01.2025).
26. Ioffe, S., Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift // Proc. ICML 2015. 2015. pp. 448-456. Доступно по ссылке: https://arxiv.org/abs/1502.03167 (Дата обращения 23.01.2025).
27. Powers, D. M. W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation // Journal of Machine Learning Technologies. 2(1). 2011. pp. 37–63. Доступно по ссылке: https://arxiv.org/abs/2010.16061 (Дата обращения 23.01.2025).
28. He, H., Garcia, E. A. Learning from Imbalanced Data // IEEE Transactions on Knowledge and Data Engineering. 21 (9). 2009. pp. 1263–1284. doi: 10.1109/TKDE.2008.239 (Дата обращения 23.01.2025).
Review
For citations:
GONCHAROVA O.V. Deep Learning for an Automatic Transcription System Development. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(1):145-158. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(1)-9