CLASSIFICATION OF AUTOMATIC SPEECH RECOGNITION SYSTEMS: FROM TRADITIONAL MODELS TO DEEP NEURAL NETWORKS

Authors

  • O.S. Atykenov Military Engineering Institute of Radio Electronics and Communications, Almaty, Republic of Kazakhstan
  • A.B. Bakasova Institute of Mechanical Engineering and Automation of the National Academy of Sciences of the Kyrgyz Republic

Keywords:

automatic speech recognition, deep neural networks, HMM, transformers, end-to-end

Abstract

The article presents a comprehensive analysis of the evolution of architectural approaches in automatic speech recognition (ASR) systems, covering the transition from statistical methods to modern end-to-end solutions. The study aims to develop a multi-level classification based on acoustic modeling principles and decoding strategies. The evolution from HMM-GMM to hybrid and fully neural architectures, including transformers and self-supervised models, is examined. The results confirm the effectiveness of the proposed classification.

References

Rabiner L. R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition // Proceedings of the IEEE. — 1989. — Vol. 77, No. 2. — P. 257–286.

Hinton G., Deng L., Yu D., Dahl G. E., Mohamed A., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Sainath T. N., Kingsbury B. Deep Neural Networks for Acoustic Modeling in Speech Recognition // IEEE Signal Processing Magazine. — 2012. — Vol. 29, No. 6. — P. 82–97.

Graves A., Mohamed A., Hinton G. Speech Recognition with Deep Recurrent Neural Networks // Proceedings of ICASSP. — 2013. — P. 6645–6649.

Graves A., Fernández S., Gomez F., Schmidhuber J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proceedings of ICML. — 2006. — P. 369–376.

Chan W., Jaitly N., Le Q. V., Vinyals O. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition // Proceedings of ICASSP. — 2016. — P. 4960–4964.

He Y., Sainath T. N., Prabhavalkar R., McGraw I., Alvarez R., Zhao D., Rybach D., Kannan A., Wu Y., Pang R. et al. Streaming End-to-End Speech Recognition for Mobile Devices // Proceedings of ICASSP. — 2019. — P. 6381–6385.

Dong L., Xu S., Xu B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition // Proceedings of ICASSP. — 2018. — P. 5884–5888.

Gulati A., Qin J., Chiu C.-C., Parmar N., Zhang Y., Yu J., Han W., Wang S., Zhang Z., Wu Y., Pang R. Conformer: Convolution-Augmented Transformer for Speech Recognition // Proceedings of Interspeech. — 2020. — P. 5036–5040.

Baevski A., Zhou H., Mohamed A., Auli M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations // Advances in Neural Information Processing Systems (NeurIPS). — 2020. — Vol. 33. — P. 12449–12460.

Hsu W.-N., Bolte B., Tsai Y.-H. H., Lakhotia K., Salakhutdinov R., Mohamed A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units // IEEE/ACM Transactions on Audio, Speech, and Language Processing. — 2021. — Vol. 29. — P. 3451–3460.

Radford A., Kim J. W., Xu T., Brockman G., McLeavey C., Sutskever I. Robust Speech Recognition via Large-Scale Weak Supervision // arXiv preprint. — 2022. — arXiv:2212.04356.

Prabhavalkar R., Hori T., Sainath T. N., Schlüter R., Watanabe S. End-to-End Speech Recognition: A Survey // IEEE/ACM Transactions on Audio, Speech, and Language Processing. — 2024. — Vol. 32. — P. 325–351.

Nayeem M. et al. Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation // arXiv preprint. — 2025. — arXiv:2510.12827.

Tabrej M. S., Deb K. J., Hakim M. A., Goswami S., Nayeem M. Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning // Informatics (MDPI). — 2025.

Li B., Chang S.-Y., Sainath T. N., Pang R., He Y., Strohman T., Wu Y. Towards Fast and Accurate Streaming End-to-End ASR // Proceedings of ICASSP. — 2020. — P. 6069–6073.

Zhang Y., Sun L., Watanabe S., Zhang Z., Yu D. WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit // arXiv preprint. — 2020. — arXiv:2010.16051.

Wang D., Li L. Speech Technology in the Era of Large Models: Progress and Challenges // Acta Automatica Sinica. — 2023. — Vol. 49, No. 1. — P. 1–30 (на кит. яз.).

Yu D., Deng L. Automatic Speech Recognition: A Deep Learning Approach. — London: Springer, 2015. — 330 p.

Panayotov V., Chen G., Povey D., Khudanpur S. LibriSpeech: An ASR Corpus Based on Public Domain Audio Books // Proceedings of ICASSP. — 2015. — P. 5206–5210.

Baevski A., Auli M., Mohamed A. Effectiveness of Self-Supervised Pre-Training for Speech Recognition // arXiv preprint. — 2019. — arXiv:1911.03912.

Schneider S., Baevski A., Collobert R., Auli M. wav2vec: Unsupervised Pre-Training for Speech Recognition // Proceedings of Interspeech. — 2019. — P. 3465–3469.

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., Polosukhin I. Attention Is All You Need // Advances in Neural Information Processing Systems (NeurIPS). — 2017. — Vol. 30. — P. 5998–6008.

Downloads

Published

2026-05-07

Issue

Section

INFORMATION TECHNOLOGY AND INFORMATION PROCESSING