The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Publications Copernicus
Download
Citation
Articles | Volume XLIV-2/W1-2021
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIV-2/W1-2021, 85–89, 2021
https://doi.org/10.5194/isprs-archives-XLIV-2-W1-2021-85-2021
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIV-2/W1-2021, 85–89, 2021
https://doi.org/10.5194/isprs-archives-XLIV-2-W1-2021-85-2021

  15 Apr 2021

15 Apr 2021

A NOVEL TASK-ORIENTED APPROACH TOWARD AUTOMATED LIP-READING SYSTEM IMPLEMENTATION

D. Ivanko and D. Ryumin D. Ivanko and D. Ryumin
  • St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences, SPC RAS, Saint-Petersburg, Russian Federation

Keywords: Automated lip-reading, Deep neural networks, Hidden Markov models, Geometric features, Region-of-interest detection

Abstract. Visual information plays a key role in automatic speech recognition (ASR) when audio is corrupted by background noise, or even inaccessible. Speech recognition using visual information is called lip-reading. The initial idea of visual speech recognition comes from humans’ experience: we are able to recognize spoken words from the observation of a speaker's face without or with limited access to the sound part of the voice. Based on the conducted experimental evaluations as well as on analysis of the research field we propose a novel task-oriented approach towards practical lip-reading system implementation. Its main purpose is to be some kind of a roadmap for researchers who need to build a reliable visual speech recognition system for their task. In a rough approximation, we can divide the task of lip-reading into two parts, depending on the complexity of the problem. First, if we need to recognize isolated words, numbers or small phrases (e.g. Telephone numbers with a strict grammar or keywords). Or second, if we need to recognize continuous speech (phrases or sentences). All these stages disclosed in detail in this paper. Based on the proposed approach we implemented from scratch automatic visual speech recognition systems of three different architectures: GMM-CHMM, DNN-HMM and purely End-to-end. A description of the methodology, tools, step-by-step development and all necessary parameters are disclosed in detail in current paper. It is worth noting that for the Russian speech recognition, such systems were created for the first time.