Investigation of Knowledge Transfer Approaches to Improve the Acoustic Modeling of Vietnamese ASR System

Danyang Liu; Ji Xu; Pengyuan Zhang; Yonghong Yan

doi:10.1109/JAS.2019.1911693

Volume 6 Issue 5

Sep. 2019

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2019 > 6(5): 1187-1195

Danyang Liu, Ji Xu, Pengyuan Zhang and Yonghong Yan, "Investigation of Knowledge Transfer Approaches to Improve the Acoustic Modeling of Vietnamese ASR System," IEEE/CAA J. Autom. Sinica, vol. 6, no. 5, pp. 1187-1195, Sept. 2019. doi: 10.1109/JAS.2019.1911693

Citation:

Danyang Liu, Ji Xu, Pengyuan Zhang and Yonghong Yan, "Investigation of Knowledge Transfer Approaches to Improve the Acoustic Modeling of Vietnamese ASR System," IEEE/CAA J. Autom. Sinica, vol. 6, no. 5, pp. 1187-1195, Sept. 2019. doi: 10.1109/JAS.2019.1911693

Citation:

PDF( 730 KB)

Investigation of Knowledge Transfer Approaches to Improve the Acoustic Modeling of Vietnamese ASR System

doi: 10.1109/JAS.2019.1911693

Funds: This work was partially supported by the National Natural Science Foundation of China (11590770-4, U1536117), the National Key Research and Development Program of China (2016YFB0801203, 2016YFB0801200), the Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (2016A03007-1), and the Pre-research Project for Equipment of General Information System (JZX2017-0994/Y306)

More Information

Author Bio:
Danyang Liu received the B.E. degree in 2015, from the School of Science, Beijing Jiaotong University. She is a Ph.D. candidate at the Key Laboratory of Speech Acoustics and Content Understanding, Chineses Academy of Sciences (CAS). Her research interests include large vocabulary continuous speech recognition and multi-lingual speech recognition

Ji Xu received the B.E. degree in 2008, from Tsinghua University, and the Ph.D. degree in 2013 from the Key Laboratory of Speech Acoustics and Content Understanding, Chineses Academy of Sciences (CAS). His research interests include large vocabulary continuous speech recognition and speech multi-lingual speech recognition

Pengyuan Zhang received the Ph.D. degree at the Information and Signal Processing from Institute of Acoustics, Chinese Academy of Sciences, in 2007. From 2013 to 2014, he was a research scholar at University of Sheffield. He is currently a Professor at the Speech Acoustics and Content Understanding Laboratory, Chinese Academy of Sciences. His research is focused on spontaneous speech recognition

Yonghong Yan received the B.E. degree from Tsinghua University in 1990, and Ph.D. degree in 1995 from Oregon Graduate Institute (OGI). He worked in OGI as an Assistant Professor (1995), Associate Professor (1998) and Associate Director (1997) of Center for Spoken Language Understanding. He worked in Intel from 1998 to 2001, chaired Human Computer Interface Research Council, worked as Principal Engineer of Microprocessor Research Laboratory and Director of Intel China Research Center. Currently he is a Professor and Director of the Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences. His research interests include speech processing and recognition, language/speaker recognition, and human computer interfaces
Corresponding author: All the authors are with the Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China (e-mail: liudanyang@hccl.ioa.ac.cn; xuji@hccl.ioa.ac.cn; zhangpengyuan@hccl.ioa.ac.cn; yanyonghong@hccl.ioa.ac.cn)
Received Date: 2018-02-09
Accepted Date: 2018-09-14

Available Online: 2019-08-02

Abstract

Abstract

It is well known that automatic speech recognition (ASR) is a resource consuming task. It takes sufficient amount of data to train a state-of-the-art deep neural network acoustic model. As for some low-resource languages where scripted speech is difficult to obtain, data sparsity is the main problem that limits the performance of speech recognition system. In this paper, several knowledge transfer methods are investigated to overcome the data sparsity problem with the help of high-resource languages. The first one is a pre-training and fine-tuning (PT/FT) method, in which the parameters of hidden layers are initialized with a well-trained neural network. Secondly, the progressive neural networks (Prognets) are investigated. With the help of lateral connections in the network architecture, Prognets are immune to forgetting effect and superior in knowledge transferring. Finally, bottleneck features (BNF) are extracted using cross-lingual deep neural networks and serves as an enhanced feature to improve the performance of ASR system. Experiments are conducted in a low-resource Vietnamese dataset. The results show that all three methods yield significant gains over the baseline system, and the Prognets acoustic model performs the best. Further improvements can be obtained by combining the Prognets model and bottleneck features.

FullText(HTML)

References(45)

References

[1]	A. Sankar and C. H. Lee, " Maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Trans. Speech &Audio Processing, vol. 4, no. 3, pp. 190–202, 1996.
[2]	M. L. Seltzer, D. Yu, and Y. Wang, " An investigation of deep neural networks for noise robust speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada: IEEE, May 2013, pp. 7398–7402.
[3]	L. Potamitis, N. Fakotakis, and G. Kokkinakis, " Independent component analysis applied to feature extraction for robust automatic speech recognition,” Electronics Letters, vol. 36, no. 23, pp. 1977–1978, 2000.
[4]	G. Saon, H. K. J. Kuo, S. Rennie, and M. Picheny, " The IBM 2015 English conversational telephone speech recognition system,” Eurasip J. Advances in Signal Processing, vol. 2008, no. 1, pp. 1–15, 2015.
[5]	R. Sahraeian and D. V. Compernolle, " Crosslingual and multilingual speech recognition based on the speech manifold,” IEEE/ACM Trans. Audio Speech &Language Processing, vol. 25, no. 12, pp. 2301–2312, 2017.
[6]	L. Lu, A. Ghoshal, and S. Renals, " Maximum a posteriori adaptation of subspace gaussian mixture models for cross-lingual speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Kyoto, Japan: IEEE, Mar. 2012, pp. 4877–4880.
[7]	K. C. Sim and H. Li, " Stream-based context-sensitive phone mapping for cross-lingual speech recognition,” in Proc. Interspeech 2009, Conf. the Int. Speech Communication Association, Brighton, United Kingdom, 2009, pp. 3019–3022.
[8]	J. Kohler, " Multilingual phone models for vocabulary-independent ¨ speech recognition tasks,” Speech Communication, vol. 35, no. 12, pp. 21–30, 2001.
[9]	Z. Tang, L. Li, and D. Wang, " Multi-task recurrent model for true multilingual speech recognition,” CoRR, vol. abs/1609.08337, 2016. [Online]. Available: http://arxiv.org/abs/1609.08337
[10]	A. Mohan and R. Rose, " Multi-lingual speech recognition with low-rank multi-task deep neural networks,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, South Brisbane, Queensland, Australia: IEEE, Apr. 2015, pp. 4994–4998.
[11]	S. Kim, T. Hori, and S. Watanabe, " Joint ctc-attention based end-to-end speech recognition using multi-task learning,” CoRR, vol. abs/1609.06773, 2016. [Online]. Available: http://arxiv.org/abs/1609.06773
[12]	Z. Tang, L. Li, D. Wang, and R. C. Vipperla, " Collaborative joint training with multi-task recurrent model for speech and speaker recognition,” IEEE/ACM Trans. Audio Speech & Language Processing, vol. 25, no. 3, pp. 493–504, Mar. 2017.
[13]	T. Robinson, M. Hochberg, and S. Renals, " IPA: improved phone modelling with recurrent neural networks,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Toulouse, France: IEEE, May 2006, pp. I/37–I/40 vol.1.
[14]	T. Schultz, N. T. Vu, and T. Schlippe, " Globalphone: a multilingual text & speech database in 20 languages,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada: IEEE, May 2013, pp. 8126–8130.
[15]	J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, " How transferable are features in deep neural networks?” in Proc. Int. Conf. Neural Information Processing Systems, Montreal, Canada, Dec. 2014, pp. 3320–3328.
[16]	D. Yu, L. Deng, and G. E. Dahl, " Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition,” in Proc. of Nips Workshop on Deep Learning & Unsupervised Feature Learning, 2010.
[17]	K. Yanai and Y. Kawano, " Food image recognition using deep convolutional network with pre-training and fine-tuning,” in Proc. IEEE Int. Conf. Multimedia & Expo Workshops, Torino, Italy: IEEE, Jun. 2015, pp. 1–6.
[18]	A. Das and H. M. Johnson, " Cross-lingual transfer learning during supervised training in low resource scenarios,” in Proc. Interspeech 2009, Conf. the Int. Speech Communication Association, Dresden, Germany, Sep. 2015, pp. 3531–3535.
[19]	E. Gasca, J. S. Snchez, and R. Alonso, " Eliminating redundancy and irrelevance using a new mlp-based feature selection method,” Pattern Recognition, vol. 39, no. 2, pp. 313–315, 2006.
[20]	A. Asaei, B. Picart, and H. Bourlard, " Analysis of phone posterior feature space exploiting class-specific sparsity and mlp-based similarity measure,” in Proc. IEEE Int. Conf. Acoustics Speech and Signal Processing, Dallas, Texas, USA: IEEE, Mar. 2010, pp. 4886–4889.
[21]	V. Fontaine, C. Ris, and J. M. Boite, " Nonlinear discriminant analysis for improved speech recognition,” in Proc. European Conf. Speech Communication and Technology, Eurospeech, Rhodes, Greece, Sep. 1997.
[22]	F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, " Probabilistic and bottle-neck features for lvcsr of meetings,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Honolulu, Hawaii, USA: IEEE, Apr. 2007, pp. IV–757– IV–760.
[23]	F. Grzl, M. Karafit, and L. Burget, " Investigation into bottle-neck features for meeting speech recognition,” in Proc. Interspeech 2009, Conf. the Int. Speech Communication Association, Brighton, United Kingdom, Sep. 2009, pp. 2947–2950.
[24]	K. Vesely, M. Karafit, F. Grzl, M. Janda, and E. Egorova, " The languageindependent bottleneck features,” in Proc. Spoken Language Technology Workshop, Miami, Florida, USA: IEEE, Dec. 2012, pp. 336–341.
[25]	E. Chuangsuwanich, Y. Zhang, and J. Glass, " Multilingual data selection for training stacked bottleneck features,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China: IEEE, Mar. 2016, pp. 5410–5414.
[26]	M. Tuerxun, S. Zhang, Y. Bao, and L. Dai, " Improvements on bottleneck feature for large vocabulary continuous speech recognition,” in Proc. Int. Conf. Signal Processing, Auckland, New Zealand, Dec. 2015, pp. 516–520.
[27]	A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, " Progressive neural networks,” CoRR, vol. abs/1606.04671, 2016. [Online]. Available: http://arxiv.org/abs/1606.04671
[28]	J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M. Provost, " Progressive neural networks for transfer learning in emotion recognition,” CoRR, vol. abs/1706.03256, 2017. [Online]. Available: http://arxiv.org/abs/1706.03256
[29]	D. Povey, X. Zhang, and S. Khudanpur, " Parallel training of dnns with natural gradient and parameter averaging,” Eprint Arxiv, 2014.
[30]	H. Yang and S. Amari, " Complexity issues in natural gradient descent method for training multilayer perceptrons,” Neural Computation, vol. 10, no. 8, pp. 2137, 1998.
[31]	M. Rattray, D. Saad, and S. I. Amari, " Natural gradient descent for on-line learning,” Phys. rev. lett, vol. 81, no. 24, pp. 5461–5464, 1998.
[32]	K. Vesely, M. Karafit, F. Grzl, M. Janda, and E. Egorova, " The languageindependent bottleneck features,” in Proc. Spoken Language Technology Workshop, Miami, Florida, USA: IEEE, Dec. 2013, pp. 336–341.
[33]	T. N. Sainath, B. Kingsbury, and B. Ramabhadran, " Auto-encoder bottleneck features using deep belief networks,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Kyoto, Japan: IEEE, Mar. 2012, pp. 4153–4156.
[34]	D. Yu and M. L. Seltzer, " Improved bottleneck features using pretrained deep neural networks,” in Proc. Interspeech Conf. of the Int. Speech Communication Association, Florence, Italy: IEEE, Aug. 2011, pp. 237–240.
[35]	J. Gehring, Y. Miao, F. Metze, and A. Waibel, " Extracting deep bottleneck features using stacked auto-encoders,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada: IEEE, May 2013, pp. 3377–3381.
[36]	H. Hermansky, " Perceptual linear predictive (plp) analysis of speech,” J. the Acoustical Society of America, vol. 87, no. 4, pp. 1738, 1990.
[37]	S. Imai, Cepstral analysis synthesis on the mel frequency scale, in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Massachusetts, USA: IEEE, Apr. 1983, pp. 93–96.
[38]	B. Chen, W. H. Chen, S. H. Lin, and W. Y. Chu, " Robust speech recognition using spatial-temporal feature distribution characteristics,” Pattern Recognition Letters, vol. 32, no. 7, pp. 919–926, 2011.
[39]	Q. B. Nguyen, J. Gehring, M. Muller, S. Stuker, and A. Waibel, " Multilingual shifting deep bottleneck features for low-resource asr,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy: IEEE, May 2014, pp. 5607–5611.
[40]	S. Wiesler, J. Li, and J. Xue, " Investigations on hessian-free optimization for cross-entropy training of deep neural networks, ” in Proc. Interspeech Conf. Int. Speech Communication Association, Lyon, France, Aug. 2013, pp. 3317–3321.
[41]	D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motliĉek, Y. Qian, and P. Schwarz, " The kaldi speech recognition toolkit,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, Hawaii, USA: IEEE Signal Processing Society, Dec. 2011.
[42]	K. Zechner and A. Waibel, " Minimizing word error rate in textual summaries of spoken language,” in Proc. 1st Meeting of the North American Chapter of the Association for Computational Linguistics Naacl, Seattle, Washington, USA: Apr. 2000.
[43]	P. Pujol, S. Pol, C. Nadeu, A. Hagen, and H. Bourlard, " Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system,” IEEE Trans. Speech &Audio Processing, vol. 13, no. 1, pp. 14–22, 2005.
[44]	K. Hornik, M. Stinchcombe, and H. White, " Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
[45]	W. Hartmann, R. Hsiao, and S. Tsakalidis, " Alternative networks for monolingual bottleneck features,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, LA, USA: IEEE, Mar. 2017, pp. 5290–5294.

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(5) / Tables(6)

Get Citation

PDF

XML

Article Metrics

Article views (2077) PDF downloads(45)

Highlights

In this paper, several knowledge transfer methods are investigated to overcome the data sparsity problem with the help of high-resource languages.
The first one is a pre-training and fine-tuning (PT/FT) method, in which the parameters of hidden layers are initialized with a well-trained neural network.
Secondly, the progressive neural networks (Prognets) are investigated. With the help of lateral connections in the network architecture, Prognets are immune to forgetting effect and superior in knowledge transferring.
Finally, bottleneck features (BNF) are extracted using cross-lingual deep neural networks and serves as an enhanced feature to improve the performance of ASR system.
Experiments are conducted in a low-resource Vietnamese dataset. The results show that all three methods yield significant gains over the baseline system, and the Prognets acoustic model performs the best. Further improvements can be obtained by combining the Prognets model and bottleneck features.

Investigation of Knowledge Transfer Approaches to Improve the Acoustic Modeling of Vietnamese ASR System

doi: 10.1109/JAS.2019.1911693

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content