This paper investigates the development of a real-time automatic speech recognition system dedicated to the Azerbaijani language, focusing on addressing the prevalent gap in speech recognition system for underrepresented languages. Our research integrates a hybrid acoustic modeling approach that combines Hidden Markov Model and Deep Neural Network to interpret the complexities of Azerbaijani acoustic patterns effectively. Recognizing the agglutinative nature of Azerbaijani, the ASR system employs a syllable-based n-gram model for language modeling, ensuring the system accurately captures the syntax and semantics of Azerbaijani speech. To enable real-time capabilities, we incorporate WebSocket technology, which facilitates efficient bidirectional communication between the client and server, necessary for processing streaming speech data instantly. The Kaldi and SRILM toolkits are used for the training of acoustic and language models, respectively, contributing to the system's robust performance and adaptability. We have conducted comprehensive experiments to test the effectiveness of our system, the results of which strongly corroborate the utility of the syllable-based subword modeling approach for Azerbaijani language recognition. Our proposed ASR system shows superior performance in terms of recognition accuracy and rapid response times, outperforming other systems tested on the same language data. The system's success not only proves beneficial for Azerbaijani language recognition but also provides a valuable framework for potential future applications in other agglutinative languages, thereby contributing to the promotion of linguistic diversity in automatic speech recognition technology (pp.55-60).
Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359–393. https://doi.org/10.1006/csla.1999.0128
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., & Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. https://doi.org/10.1109/msp.2012.2205597
Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing (2nd ed., pp. 83-122). Prentice Hall.
MDN contributors. The WebSocket API. Retrieved May 13, 2023,
Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Interspeech 2015.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N.K., Hannemann, M., Motlícek, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., & Veselý, K. (2011). The Kaldi Speech Recognition Toolkit.
Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., & Khudanpur, S. (2016). Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI. Interspeech 2016.
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. https://doi.org/10.1109/5.18626
Rustamov, S., Akhundova, N., & Valizada, A. (2019). Automatic Speech Recognition in Taxi Call Service Systems. Automatic Speech Recognition in Taxi Call Service Systems | SpringerLink.
Stolcke, A. (2002). SRILM - an extensible language modeling toolkit. 7th International Conference on Spoken Language Processing (ICSLP 2002).
The Azerbaijan State News Agency (2022). [Dataset]. https://azertag.az/
Valizada, A. (2021). Subword Speech Recognition for Agglutinative Languages. 2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT).
Valizada, A., Akhundova, N., & Rustamov, S. (2021). Development of Speech Recognition Systems in Emergency Call Centers. MDPI.
Wikimedia, (2023). azwiki dump (20230320) [Dataset]. https://dumps.wikimedia.org/azwiki/20230320/