№2, 2023

DEVELOPMENT OF A REAL-TIME SPEECH RECOGNITION SYSTEM FOR THE AZERBAIJANI LANGUAGE
Alakbar T. Valizada

This paper investigates the development of a real-time automatic speech recognition system dedicated to the Azerbaijani language, focusing on addressing the prevalent gap in speech recognition system for underrepresented languages. Our research integrates a hybrid acoustic modeling approach that combines Hidden Markov Model and Deep Neural Network to interpret the complexities of Azerbaijani acoustic patterns effectively. Recognizing the agglutinative nature of Azerbaijani, the ASR system employs a syllable-based n-gram model for language modeling, ensuring the system accurately captures the syntax and semantics of Azerbaijani speech. To enable real-time capabilities, we incorporate WebSocket technology, which facilitates efficient bidirectional communication between the client and server, necessary for processing streaming speech data instantly. The Kaldi and SRILM toolkits are used for the training of acoustic and language models, respectively, contributing to the system's robust performance and adaptability. We have conducted comprehensive experiments to test the effectiveness of our system, the results of which strongly corroborate the utility of the syllable-based subword modeling approach for Azerbaijani language recognition. Our proposed ASR system shows superior performance in terms of recognition accuracy and rapid response times, outperforming other systems tested on the same language data. The system's success not only proves beneficial for Azerbaijani language recognition but also provides a valuable framework for potential future applications in other agglutinative languages, thereby contributing to the promotion of linguistic diversity in automatic speech recognition technology (pp.55-60).

Keywords:Hybrid Speech Recognition, N-gram, Kaldi, Agglutinative languages, Real-time speech recognition, WebSocket
References

Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359–393.      https://doi.org/10.1006/csla.1999.0128

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., & Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. https://doi.org/10.1109/msp.2012.2205597

Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing (2nd ed., pp. 83-122). Prentice Hall.

MDN contributors. The WebSocket API. Retrieved May 13, 2023,
https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API

Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Interspeech 2015.
https://doi.org/10.21437/interspeech.2015-647

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N.K., Hannemann, M., Motlícek, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., & Veselý, K. (2011). The Kaldi Speech Recognition Toolkit.

Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., & Khudanpur, S. (2016). Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI. Interspeech 2016.
https://doi.org/10.21437/interspeech.2016-595

Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. https://doi.org/10.1109/5.18626

Rustamov, S., Akhundova, N., & Valizada, A. (2019). Automatic Speech Recognition in Taxi Call Service Systems. Automatic Speech Recognition in Taxi Call Service Systems | SpringerLink.
https://doi.org/10.1007/978-3-030-23943-5_18

Stolcke, A. (2002). SRILM - an extensible language modeling toolkit. 7th International Conference on Spoken Language Processing (ICSLP 2002).
https://doi.org/10.21437/icslp.2002-303

The Azerbaijan State News Agency (2022). [Dataset]. https://azertag.az/

Valizada, A. (2021). Subword Speech Recognition for Agglutinative Languages. 2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT).
https://doi.org/10.1109/aict52784.2021.96204

Valizada, A., Akhundova, N., & Rustamov, S. (2021). Development of Speech Recognition Systems in Emergency Call Centers. MDPI.
https://doi.org/10.3390/sym13040634

Wikimedia, (2023). azwiki dump (20230320) [Dataset]. https://dumps.wikimedia.org/azwiki/20230320/