A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 8 Issue 7
Jul.  2021

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 6.171, Top 11% (SCI Q1)
    CiteScore: 11.2, Top 5% (Q1)
    Google Scholar h5-index: 51, TOP 8
Turn off MathJax
Article Contents
Pei Liu, Yingjie Zhou, Dezhong Peng and Dapeng Wu, "Global-Attention-Based Neural Networks for Vision Language Intelligence," IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1243-1252, July 2021. doi: 10.1109/JAS.2020.1003402
Citation: Pei Liu, Yingjie Zhou, Dezhong Peng and Dapeng Wu, "Global-Attention-Based Neural Networks for Vision Language Intelligence," IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1243-1252, July 2021. doi: 10.1109/JAS.2020.1003402

Global-Attention-Based Neural Networks for Vision Language Intelligence

doi: 10.1109/JAS.2020.1003402
Funds:  This work was supported by the National Natural Science Foundation of China (61971296, U19A2078, 61836011, 61801315), the Ministry of Education and China Mobile Research Foundation Project (MCM20180405), and Sichuan Science and Technology Planning Project (2019YFG0495, 2021YFG0301, 2021YFG0317, 2020YFG0319, 2020YFH0186)
More Information
  • In this paper, we develop a novel global-attention-based neural network (GANN) for vision language intelligence, specifically, image captioning (language description of a given image). As many previous works, the encoder-decoder framework is adopted in our proposed model, in which the encoder is responsible for encoding the region proposal features and extracting global caption feature based on a specially designed module of predicting the caption objects, and the decoder generates captions by taking the obtained global caption feature along with the encoded visual features as inputs for each attention head of the decoder layer. The global caption feature is introduced for the purpose of exploring the latent contributions of region proposals for image captioning, and further helping the decoder better focus on the most relevant proposals so as to extract more accurate visual feature in each time step of caption generation. Our GANN is implemented by incorporating the global caption feature into the attention weight calculation phase in the word predication process in each head of the decoder layer. In our experiments, we qualitatively analyzed the proposed model, and quantitatively evaluated several state-of-the-art schemes with GANN on the MS-COCO dataset. Experimental results demonstrate the effectiveness of the proposed global attention mechanism for image captioning.

     

  • loading
  • [1]
    K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 2048–2057.
    [2]
    R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural language models,” in Proc. 31st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 595–603.
    [3]
    Q. Z. You, H. L. Jin, Z. W. Wang, C. Fang, and J. B. Luo, “Image captioning with semantic attention,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 4651–4659.
    [4]
    P. Anderson, X. D. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 6077–6086.
    [5]
    J. S. Lu, C. M. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 375–383.
    [6]
    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 3156–3164.
    [7]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 5998–6008.
    [8]
    L. Huang, W. M. Wang, Y. X. Xia, and J. Chen, “Adaptively aligned image captioning via adaptive attention time,” in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 8940–8949.
    [9]
    J. Wu, T. S. Chen, H. F. Wu, Z. Yang, Q. Wang, and L. Lin, “Concrete image captioning by integrating content sensitive and global discriminative objective,” in Proc. IEEE Int. Conf. Multimedia and Expo, Shanghai, China, 2019, pp. 1306–1311.
    [10]
    J. X. Gu, S. Joty, J. F. Cai, H. D. Zhao, X. Yang, and G. Wang, “Unpaired image captioning via scene graph alignments,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Korea (South), 2019, pp. 10323–10332.
    [11]
    L. Huang, W. M. Wang, J. Chen, and X. Y. Wei, “Attention on attention for image captioning,” arXiv e-prints, page arXiv: 1908.06954, Aug. 2019.
    [12]
    K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 770–778.
    [13]
    S. Q. Ren, K. M. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 91–99.
    [14]
    K. M. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2961–2969.
    [15]
    D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv: 1409.0473, May 2016.
    [16]
    J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv: 1810.04805, May 2019.
    [17]
    Y. C. Xu, X. D. Liu, Y. L. Shen, J. J. Liu, and J. F. Gao, “Multi-task learning with sample re-weighting for machine reading comprehension,” arXiv preprint arXiv: 1809.06963, Mar. 2019.
    [18]
    M. Soh, “Learning CNN-LSTM architectures for image caption generation,” Stanford Univ., Stanford, USA, 2016.
    [19]
    M. H. Chen, G. G. Ding, S. C. Zhao, H. Chen, J. G. Han, and Q. Liu, “Reference based LSTM for image captioning,” in Proc. 31st AAAI Conf. Artificial Intelligence, San Francisco, USA, 2017.
    [20]
    S. M. Lakew, M. Cettolo, and M. Federico, “A comparison of transformer and recurrent neural networks on multilingual neural machine translation,” arXiv preprint arXiv: 1806.06957, Jun. 2018.
    [21]
    C. G. Wang, Mu Li, and A. J. Smola, “Language models with transformers,” arXiv preprint arXiv: 1904.09408, Oct. 2019.
    [22]
    J. Yu, J. Li, Z. Yu, and Q. M. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE Trans. Circuits Syst. Video Technol, vol. 30, no. 12, pp. 4467–4480, Dec. 2020. doi: 10.1109/TCSVT.2019.2947482
    [23]
    S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” arXiv preprint arXiv: 1906.05963, Jan. 2020.
    [24]
    M. T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv: 1508.04025, Sep. 2015.
    [25]
    M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” arXiv preprint arXiv: 1912.08226, Mar. 2020.
    [26]
    T. Yao, Y. W. Pan, Y. H. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 684–699.
    [27]
    L. L. Gao, Z. Guo, H. W. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based LSTM and semantic consistency,” IEEE Trans. Multimed., vol. 19, no. 9, pp. 2045–2055, Sep. 2017. doi: 10.1109/TMM.2017.2729019
    [28]
    L. L. Gao, X. P. Li, J. K. Song, and H. T. Shen, “Hierarchical LSTMS with adaptive attention for visual captioning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 5, pp. 1112–1131, May 2019.
    [29]
    Z. Gan, C. Gan, X. D. He, Y. C. Pu, K. Tran, J. F. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 5630–5639.
    [30]
    Y. W. Pan, T. Yao, H. Q. Li, and T. Mei, “Video captioning with transferred semantic attributes,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 6504–6512.
    [31]
    T. Yao, Y. W. Pan, Y. H. Li, Z. F. Qiu, and T. Mei, “Boosting image captioning with attributes,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 4894–4902.
    [32]
    K. Fu, J. Q. Jin, R. P. Cui, F. Sha, and C. S. Zhang, “Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2321–2334, Dec. 2017. doi: 10.1109/TPAMI.2016.2642953
    [33]
    F. Liu, T. Xiang, T. M. Hospedales, W. K. Yang, and C. Y. Sun, “Semantic regularisation for recurrent image annotation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 2872–2880.
    [34]
    S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 7008–7024.
    [35]
    R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 4566–4575.
    [36]
    T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.
    [37]
    A. Karpathy and F. F. Li, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 3128–3137.
    [38]
    W. H. Jiang, L. Ma, Y. G. Jiang, W. Liu, and T. Zhang, “Recurrent fusion network for image captioning,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 499–515.
    [39]
    P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” in Proc. 14th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 382–398.
    [40]
    K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meet. Association for Computational Linguistics, Philadelphia, USA, 2002, pp. 311–318.
    [41]
    M. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Proc. 9th Workshop on Statistical Machine Translation, Baltimore, USA, 2014, pp. 376–380.
    [42]
    C. Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. Workship on Text Summarization Branches Out, Barcelona, Spain, 2004, pp. 74–81.
    [43]
    S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 1171–1179.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(8)  / Tables(3)

    Article Metrics

    Article views (1290) PDF downloads(69) Cited by()

    Highlights

    • Proposed a model in which the global information is incorporated into the attention weight calculation process. The number of local regions is larger than the actual object appeared in a sentence; the we want to activate local regions as less as possible to avoid noises.
    • Experiment analysis;
    • A multi-task learning approach, in which the global information extraction and training strategy.

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return