TY - GEN
T1 - Convolutional Grid Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition
AU - Xue, Jiabin
AU - Zheng, Tieran
AU - Han, Jiqing
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2019.
PY - 2019
Y1 - 2019
N2 - The Grid Long Short-Term Memory (Grid-LSTM), which is consisted of three steps, i.e., two-dimensional grid splitting, local feature projection, and grid sequence modeling, has been widely used in Automatic Speech Recognition (ASR) tasks, since it has a strong time-frequency modeling ability. However, the network suffers from a serious problem that heavy computing time is always required. It can be found that the reason for this problem is in the last step, two cross-working LSTMs are employed to model time-frequency features in the grid via an analysis of its process. Thus, we try to speed up the Grid-LSTM by using a smaller grid and propose two enhanced Grid-LSTM models, i.e., Convolutional Grid-LSTM (ConvGrid-LSTM) and Multichannel ConvGrid-LSTM (MCConvGrid-LSTM) to reduce the grid size from the two dimensions of the Grid-LSTM respectively. In the frequency axis, we try to do this by using a large frequency stride and further to prevent performance loss by embedding a CNN in the Grid-LSTM. Moreover, in the time axis, we model several adjacent frames by the multichannel processing ability of CNN. Our method achieves (formula presented) relative reduction of training time and (formula presented) relative reduction of Word Error Rate (WER) for a character level End-to-End ASR task.
AB - The Grid Long Short-Term Memory (Grid-LSTM), which is consisted of three steps, i.e., two-dimensional grid splitting, local feature projection, and grid sequence modeling, has been widely used in Automatic Speech Recognition (ASR) tasks, since it has a strong time-frequency modeling ability. However, the network suffers from a serious problem that heavy computing time is always required. It can be found that the reason for this problem is in the last step, two cross-working LSTMs are employed to model time-frequency features in the grid via an analysis of its process. Thus, we try to speed up the Grid-LSTM by using a smaller grid and propose two enhanced Grid-LSTM models, i.e., Convolutional Grid-LSTM (ConvGrid-LSTM) and Multichannel ConvGrid-LSTM (MCConvGrid-LSTM) to reduce the grid size from the two dimensions of the Grid-LSTM respectively. In the frequency axis, we try to do this by using a large frequency stride and further to prevent performance loss by embedding a CNN in the Grid-LSTM. Moreover, in the time axis, we model several adjacent frames by the multichannel processing ability of CNN. Our method achieves (formula presented) relative reduction of training time and (formula presented) relative reduction of Word Error Rate (WER) for a character level End-to-End ASR task.
KW - Automatic Speech Recognition
KW - Convolutional Neural Network
KW - Grid-LSTM
UR - https://www.scopus.com/pages/publications/85078461109
U2 - 10.1007/978-3-030-36802-9_76
DO - 10.1007/978-3-030-36802-9_76
M3 - 会议稿件
AN - SCOPUS:85078461109
SN - 9783030368012
T3 - Communications in Computer and Information Science
SP - 718
EP - 726
BT - Neural Information Processing - 26th International Conference, ICONIP 2019, Proceedings
A2 - Gedeon, Tom
A2 - Wong, Kok Wai
A2 - Lee, Minho
PB - Springer
T2 - 26th International Conference on Neural Information Processing, ICONIP 2019
Y2 - 12 December 2019 through 15 December 2019
ER -