A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 6 Issue 5
Sep.  2019

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 6.171, Top 11% (SCI Q1)
    CiteScore: 11.2, Top 5% (Q1)
    Google Scholar h5-index: 51, TOP 8
Turn off MathJax
Article Contents
Danyang Liu, Ji Xu, Pengyuan Zhang and Yonghong Yan, "Investigation of Knowledge Transfer Approaches to Improve the Acoustic Modeling of Vietnamese ASR System," IEEE/CAA J. Autom. Sinica, vol. 6, no. 5, pp. 1187-1195, Sept. 2019. doi: 10.1109/JAS.2019.1911693
Citation: Danyang Liu, Ji Xu, Pengyuan Zhang and Yonghong Yan, "Investigation of Knowledge Transfer Approaches to Improve the Acoustic Modeling of Vietnamese ASR System," IEEE/CAA J. Autom. Sinica, vol. 6, no. 5, pp. 1187-1195, Sept. 2019. doi: 10.1109/JAS.2019.1911693

Investigation of Knowledge Transfer Approaches to Improve the Acoustic Modeling of Vietnamese ASR System

doi: 10.1109/JAS.2019.1911693
Funds:  This work was partially supported by the National Natural Science Foundation of China (11590770-4, U1536117), the National Key Research and Development Program of China (2016YFB0801203, 2016YFB0801200), the Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (2016A03007-1), and the Pre-research Project for Equipment of General Information System (JZX2017-0994/Y306)
More Information
  • It is well known that automatic speech recognition (ASR) is a resource consuming task. It takes sufficient amount of data to train a state-of-the-art deep neural network acoustic model. As for some low-resource languages where scripted speech is difficult to obtain, data sparsity is the main problem that limits the performance of speech recognition system. In this paper, several knowledge transfer methods are investigated to overcome the data sparsity problem with the help of high-resource languages. The first one is a pre-training and fine-tuning (PT/FT) method, in which the parameters of hidden layers are initialized with a well-trained neural network. Secondly, the progressive neural networks (Prognets) are investigated. With the help of lateral connections in the network architecture, Prognets are immune to forgetting effect and superior in knowledge transferring. Finally, bottleneck features (BNF) are extracted using cross-lingual deep neural networks and serves as an enhanced feature to improve the performance of ASR system. Experiments are conducted in a low-resource Vietnamese dataset. The results show that all three methods yield significant gains over the baseline system, and the Prognets acoustic model performs the best. Further improvements can be obtained by combining the Prognets model and bottleneck features.

     

  • loading
  • [1]
    A. Sankar and C. H. Lee, " Maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Trans. Speech &Audio Processing, vol. 4, no. 3, pp. 190–202, 1996.
    [2]
    M. L. Seltzer, D. Yu, and Y. Wang, " An investigation of deep neural networks for noise robust speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada: IEEE, May 2013, pp. 7398–7402.
    [3]
    L. Potamitis, N. Fakotakis, and G. Kokkinakis, " Independent component analysis applied to feature extraction for robust automatic speech recognition,” Electronics Letters, vol. 36, no. 23, pp. 1977–1978, 2000.
    [4]
    G. Saon, H. K. J. Kuo, S. Rennie, and M. Picheny, " The IBM 2015 English conversational telephone speech recognition system,” Eurasip J. Advances in Signal Processing, vol. 2008, no. 1, pp. 1–15, 2015.
    [5]
    R. Sahraeian and D. V. Compernolle, " Crosslingual and multilingual speech recognition based on the speech manifold,” IEEE/ACM Trans. Audio Speech &Language Processing, vol. 25, no. 12, pp. 2301–2312, 2017.
    [6]
    L. Lu, A. Ghoshal, and S. Renals, " Maximum a posteriori adaptation of subspace gaussian mixture models for cross-lingual speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Kyoto, Japan: IEEE, Mar. 2012, pp. 4877–4880.
    [7]
    K. C. Sim and H. Li, " Stream-based context-sensitive phone mapping for cross-lingual speech recognition,” in Proc. Interspeech 2009, Conf. the Int. Speech Communication Association, Brighton, United Kingdom, 2009, pp. 3019–3022.
    [8]
    J. Kohler, " Multilingual phone models for vocabulary-independent ¨ speech recognition tasks,” Speech Communication, vol. 35, no. 12, pp. 21–30, 2001.
    [9]
    Z. Tang, L. Li, and D. Wang, " Multi-task recurrent model for true multilingual speech recognition,” CoRR, vol. abs/1609.08337, 2016. [Online]. Available: http://arxiv.org/abs/1609.08337
    [10]
    A. Mohan and R. Rose, " Multi-lingual speech recognition with low-rank multi-task deep neural networks,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, South Brisbane, Queensland, Australia: IEEE, Apr. 2015, pp. 4994–4998.
    [11]
    S. Kim, T. Hori, and S. Watanabe, " Joint ctc-attention based end-to-end speech recognition using multi-task learning,” CoRR, vol. abs/1609.06773, 2016. [Online]. Available: http://arxiv.org/abs/1609.06773
    [12]
    Z. Tang, L. Li, D. Wang, and R. C. Vipperla, " Collaborative joint training with multi-task recurrent model for speech and speaker recognition,” IEEE/ACM Trans. Audio Speech & Language Processing, vol. 25, no. 3, pp. 493–504, Mar. 2017.
    [13]
    T. Robinson, M. Hochberg, and S. Renals, " IPA: improved phone modelling with recurrent neural networks,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Toulouse, France: IEEE, May 2006, pp. I/37–I/40 vol.1.
    [14]
    T. Schultz, N. T. Vu, and T. Schlippe, " Globalphone: a multilingual text & speech database in 20 languages,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada: IEEE, May 2013, pp. 8126–8130.
    [15]
    J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, " How transferable are features in deep neural networks?” in Proc. Int. Conf. Neural Information Processing Systems, Montreal, Canada, Dec. 2014, pp. 3320–3328.
    [16]
    D. Yu, L. Deng, and G. E. Dahl, " Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition,” in Proc. of Nips Workshop on Deep Learning & Unsupervised Feature Learning, 2010.
    [17]
    K. Yanai and Y. Kawano, " Food image recognition using deep convolutional network with pre-training and fine-tuning,” in Proc. IEEE Int. Conf. Multimedia & Expo Workshops, Torino, Italy: IEEE, Jun. 2015, pp. 1–6.
    [18]
    A. Das and H. M. Johnson, " Cross-lingual transfer learning during supervised training in low resource scenarios,” in Proc. Interspeech 2009, Conf. the Int. Speech Communication Association, Dresden, Germany, Sep. 2015, pp. 3531–3535.
    [19]
    E. Gasca, J. S. Snchez, and R. Alonso, " Eliminating redundancy and irrelevance using a new mlp-based feature selection method,” Pattern Recognition, vol. 39, no. 2, pp. 313–315, 2006.
    [20]
    A. Asaei, B. Picart, and H. Bourlard, " Analysis of phone posterior feature space exploiting class-specific sparsity and mlp-based similarity measure,” in Proc. IEEE Int. Conf. Acoustics Speech and Signal Processing, Dallas, Texas, USA: IEEE, Mar. 2010, pp. 4886–4889.
    [21]
    V. Fontaine, C. Ris, and J. M. Boite, " Nonlinear discriminant analysis for improved speech recognition,” in Proc. European Conf. Speech Communication and Technology, Eurospeech, Rhodes, Greece, Sep. 1997.
    [22]
    F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, " Probabilistic and bottle-neck features for lvcsr of meetings,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Honolulu, Hawaii, USA: IEEE, Apr. 2007, pp. IV–757– IV–760.
    [23]
    F. Grzl, M. Karafit, and L. Burget, " Investigation into bottle-neck features for meeting speech recognition,” in Proc. Interspeech 2009, Conf. the Int. Speech Communication Association, Brighton, United Kingdom, Sep. 2009, pp. 2947–2950.
    [24]
    K. Vesely, M. Karafit, F. Grzl, M. Janda, and E. Egorova, " The languageindependent bottleneck features,” in Proc. Spoken Language Technology Workshop, Miami, Florida, USA: IEEE, Dec. 2012, pp. 336–341.
    [25]
    E. Chuangsuwanich, Y. Zhang, and J. Glass, " Multilingual data selection for training stacked bottleneck features,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China: IEEE, Mar. 2016, pp. 5410–5414.
    [26]
    M. Tuerxun, S. Zhang, Y. Bao, and L. Dai, " Improvements on bottleneck feature for large vocabulary continuous speech recognition,” in Proc. Int. Conf. Signal Processing, Auckland, New Zealand, Dec. 2015, pp. 516–520.
    [27]
    A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, " Progressive neural networks,” CoRR, vol. abs/1606.04671, 2016. [Online]. Available: http://arxiv.org/abs/1606.04671
    [28]
    J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M. Provost, " Progressive neural networks for transfer learning in emotion recognition,” CoRR, vol. abs/1706.03256, 2017. [Online]. Available: http://arxiv.org/abs/1706.03256
    [29]
    D. Povey, X. Zhang, and S. Khudanpur, " Parallel training of dnns with natural gradient and parameter averaging,” Eprint Arxiv, 2014.
    [30]
    H. Yang and S. Amari, " Complexity issues in natural gradient descent method for training multilayer perceptrons,” Neural Computation, vol. 10, no. 8, pp. 2137, 1998.
    [31]
    M. Rattray, D. Saad, and S. I. Amari, " Natural gradient descent for on-line learning,” Phys. rev. lett, vol. 81, no. 24, pp. 5461–5464, 1998.
    [32]
    K. Vesely, M. Karafit, F. Grzl, M. Janda, and E. Egorova, " The languageindependent bottleneck features,” in Proc. Spoken Language Technology Workshop, Miami, Florida, USA: IEEE, Dec. 2013, pp. 336–341.
    [33]
    T. N. Sainath, B. Kingsbury, and B. Ramabhadran, " Auto-encoder bottleneck features using deep belief networks,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Kyoto, Japan: IEEE, Mar. 2012, pp. 4153–4156.
    [34]
    D. Yu and M. L. Seltzer, " Improved bottleneck features using pretrained deep neural networks,” in Proc. Interspeech Conf. of the Int. Speech Communication Association, Florence, Italy: IEEE, Aug. 2011, pp. 237–240.
    [35]
    J. Gehring, Y. Miao, F. Metze, and A. Waibel, " Extracting deep bottleneck features using stacked auto-encoders,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada: IEEE, May 2013, pp. 3377–3381.
    [36]
    H. Hermansky, " Perceptual linear predictive (plp) analysis of speech,” J. the Acoustical Society of America, vol. 87, no. 4, pp. 1738, 1990.
    [37]
    S. Imai, Cepstral analysis synthesis on the mel frequency scale, in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Massachusetts, USA: IEEE, Apr. 1983, pp. 93–96.
    [38]
    B. Chen, W. H. Chen, S. H. Lin, and W. Y. Chu, " Robust speech recognition using spatial-temporal feature distribution characteristics,” Pattern Recognition Letters, vol. 32, no. 7, pp. 919–926, 2011.
    [39]
    Q. B. Nguyen, J. Gehring, M. Muller, S. Stuker, and A. Waibel, " Multilingual shifting deep bottleneck features for low-resource asr,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy: IEEE, May 2014, pp. 5607–5611.
    [40]
    S. Wiesler, J. Li, and J. Xue, " Investigations on hessian-free optimization for cross-entropy training of deep neural networks, ” in Proc. Interspeech Conf. Int. Speech Communication Association, Lyon, France, Aug. 2013, pp. 3317–3321.
    [41]
    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motliĉek, Y. Qian, and P. Schwarz, " The kaldi speech recognition toolkit,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, Hawaii, USA: IEEE Signal Processing Society, Dec. 2011.
    [42]
    K. Zechner and A. Waibel, " Minimizing word error rate in textual summaries of spoken language,” in Proc. 1st Meeting of the North American Chapter of the Association for Computational Linguistics Naacl, Seattle, Washington, USA: Apr. 2000.
    [43]
    P. Pujol, S. Pol, C. Nadeu, A. Hagen, and H. Bourlard, " Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system,” IEEE Trans. Speech &Audio Processing, vol. 13, no. 1, pp. 14–22, 2005.
    [44]
    K. Hornik, M. Stinchcombe, and H. White, " Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
    [45]
    W. Hartmann, R. Hsiao, and S. Tsakalidis, " Alternative networks for monolingual bottleneck features,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, LA, USA: IEEE, Mar. 2017, pp. 5290–5294.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(5)  / Tables(6)

    Article Metrics

    Article views (1057) PDF downloads(36) Cited by()

    Highlights

    • In this paper, several knowledge transfer methods are investigated to overcome the data sparsity problem with the help of high-resource languages.
    • The first one is a pre-training and fine-tuning (PT/FT) method, in which the parameters of hidden layers are initialized with a well-trained neural network.
    • Secondly, the progressive neural networks (Prognets) are investigated. With the help of lateral connections in the network architecture, Prognets are immune to forgetting effect and superior in knowledge transferring.
    • Finally, bottleneck features (BNF) are extracted using cross-lingual deep neural networks and serves as an enhanced feature to improve the performance of ASR system.
    • Experiments are conducted in a low-resource Vietnamese dataset. The results show that all three methods yield significant gains over the baseline system, and the Prognets acoustic model performs the best. Further improvements can be obtained by combining the Prognets model and bottleneck features.

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return