СҮЙЛӨӨНҮ АВТОМАТТЫК ТААНУУ СИСТЕМАЛАРЫНЫН КЛАССИФИКАЦИЯСЫ: САЛТТУУ МОДЕЛДЕРДЕН ТЕРЕҢ НЕЙРОНДУК ТАРМАКТАРГА ЧЕЙИН

Авторлор

О.С. Атыкенов Радиоэлектроника жана байланыш аскердик-инженердик институту, Алматы ш., Казакстан Республикасы
А.Б. Бакасова Кыргыз Республикасынын Улуттук илимдер академиясынын Машина таануу жана автоматика институту

##semicolon##

сүйлөөнү автоматтык таануу##common.commaListSeparator## нейрондук тармактар##common.commaListSeparator## HMM##common.commaListSeparator## трансформерлер##common.commaListSeparator## end-to-end

Аннотация

Макалада сүйлөөнү автоматтык таануу системаларынын архитектуралык ыкмаларынын эволюциясы талданат. Изилдөөдө статистикалык моделдерден баштап заманбап end-to-end системаларына чейинки өтүү каралат. Максат – акустикалык моделдөөгө негизделген көп деңгээлдүү классификация түзүү. Натыйжалар сунушталган классификациянын натыйжалуулугун тастыктайт.

##submission.citations##

Rabiner L. R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition // Proceedings of the IEEE. — 1989. — Vol. 77, No. 2. — P. 257–286.

Hinton G., Deng L., Yu D., Dahl G. E., Mohamed A., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Sainath T. N., Kingsbury B. Deep Neural Networks for Acoustic Modeling in Speech Recognition // IEEE Signal Processing Magazine. — 2012. — Vol. 29, No. 6. — P. 82–97.

Graves A., Mohamed A., Hinton G. Speech Recognition with Deep Recurrent Neural Networks // Proceedings of ICASSP. — 2013. — P. 6645–6649.

Graves A., Fernández S., Gomez F., Schmidhuber J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proceedings of ICML. — 2006. — P. 369–376.

Chan W., Jaitly N., Le Q. V., Vinyals O. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition // Proceedings of ICASSP. — 2016. — P. 4960–4964.

He Y., Sainath T. N., Prabhavalkar R., McGraw I., Alvarez R., Zhao D., Rybach D., Kannan A., Wu Y., Pang R. et al. Streaming End-to-End Speech Recognition for Mobile Devices // Proceedings of ICASSP. — 2019. — P. 6381–6385.

Dong L., Xu S., Xu B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition // Proceedings of ICASSP. — 2018. — P. 5884–5888.

Gulati A., Qin J., Chiu C.-C., Parmar N., Zhang Y., Yu J., Han W., Wang S., Zhang Z., Wu Y., Pang R. Conformer: Convolution-Augmented Transformer for Speech Recognition // Proceedings of Interspeech. — 2020. — P. 5036–5040.

Baevski A., Zhou H., Mohamed A., Auli M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations // Advances in Neural Information Processing Systems (NeurIPS). — 2020. — Vol. 33. — P. 12449–12460.

Hsu W.-N., Bolte B., Tsai Y.-H. H., Lakhotia K., Salakhutdinov R., Mohamed A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units // IEEE/ACM Transactions on Audio, Speech, and Language Processing. — 2021. — Vol. 29. — P. 3451–3460.

Radford A., Kim J. W., Xu T., Brockman G., McLeavey C., Sutskever I. Robust Speech Recognition via Large-Scale Weak Supervision // arXiv preprint. — 2022. — arXiv:2212.04356.

Prabhavalkar R., Hori T., Sainath T. N., Schlüter R., Watanabe S. End-to-End Speech Recognition: A Survey // IEEE/ACM Transactions on Audio, Speech, and Language Processing. — 2024. — Vol. 32. — P. 325–351.

Nayeem M. et al. Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation // arXiv preprint. — 2025. — arXiv:2510.12827.

Tabrej M. S., Deb K. J., Hakim M. A., Goswami S., Nayeem M. Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning // Informatics (MDPI). — 2025.

Li B., Chang S.-Y., Sainath T. N., Pang R., He Y., Strohman T., Wu Y. Towards Fast and Accurate Streaming End-to-End ASR // Proceedings of ICASSP. — 2020. — P. 6069–6073.

Zhang Y., Sun L., Watanabe S., Zhang Z., Yu D. WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit // arXiv preprint. — 2020. — arXiv:2010.16051.

Wang D., Li L. Speech Technology in the Era of Large Models: Progress and Challenges // Acta Automatica Sinica. — 2023. — Vol. 49, No. 1. — P. 1–30 (на кит. яз.).

Yu D., Deng L. Automatic Speech Recognition: A Deep Learning Approach. — London: Springer, 2015. — 330 p.

Panayotov V., Chen G., Povey D., Khudanpur S. LibriSpeech: An ASR Corpus Based on Public Domain Audio Books // Proceedings of ICASSP. — 2015. — P. 5206–5210.

Baevski A., Auli M., Mohamed A. Effectiveness of Self-Supervised Pre-Training for Speech Recognition // arXiv preprint. — 2019. — arXiv:1911.03912.

Schneider S., Baevski A., Collobert R., Auli M. wav2vec: Unsupervised Pre-Training for Speech Recognition // Proceedings of Interspeech. — 2019. — P. 3465–3469.

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., Polosukhin I. Attention Is All You Need // Advances in Neural Information Processing Systems (NeurIPS). — 2017. — Vol. 30. — P. 5998–6008.