Global-Attention-Based Neural Networks for Vision Language Intelligence

Pei Liu; Yingjie Zhou; Dezhong Peng; Dapeng Wu

doi:10.1109/JAS.2020.1003402

Volume 8 Issue 7

Jul. 2021

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2021 > 8(7): 1243-1252

P. Liu, Y. J. Zhou, D. Z. Peng, and D. P. Wu, "Global-Attention-Based Neural Networks for Vision Language Intelligence," IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1243-1252, Jul. 2021. doi: 10.1109/JAS.2020.1003402

Citation:

P. Liu, Y. J. Zhou, D. Z. Peng, and D. P. Wu, "Global-Attention-Based Neural Networks for Vision Language Intelligence," IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1243-1252, Jul. 2021. doi: 10.1109/JAS.2020.1003402

Citation:

PDF( 1184 KB)

Global-Attention-Based Neural Networks for Vision Language Intelligence

doi: 10.1109/JAS.2020.1003402

Pei Liu^1
,,
Yingjie Zhou^1
,,
Dezhong Peng^{1, 2, 3
,
,},
Dapeng Wu^4
,

1.
College of Computer Science, Sichuan University, Chengdu 610065, China
2.
Sichuan Zhiqian Technology Co., Ltd., Chengdu 610041, China
3.
Shenzhen Peng Cheng Laboratory, Shenzhen 518052, China
4.
Department of Electrical and Computer Engineering, University of Florida, Gainesville FL 32611 USA

Funds: This work was supported by the National Natural Science Foundation of China (61971296, U19A2078, 61836011, 61801315), the Ministry of Education and China Mobile Research Foundation Project (MCM20180405), and Sichuan Science and Technology Planning Project (2019YFG0495, 2021YFG0301, 2021YFG0317, 2020YFG0319, 2020YFH0186)

More Information

Author Bio:
Pei Liu received the M.Sc. degree in applied mathematics in 2015 from Chengdu University of Information Engineering, and now is a Ph.D. candidate at Sichuan University. He was a Visiting Scholar with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA. His research interests include natural language processing, computer vision, and machine learning

Yingjie Zhou (M’14) received the Ph.D. degree at the School of Communication and Information Engineering from University of Electronic Science and Technology of China (UESTC), China, in 2013. He is currently an Assistant Professor at the College of Computer Science at Sichuan University (SCU), China. He was a Visiting Scholar in the Department of Electrical Engineering at Columbia University, New York, USA. His research interests include network management, behavioral data analysis, resource allocation, and neural networks

Dezhong Peng (M’09) received the B.Sc. degree in applied mathematics, and the M.Sc. and Ph.D. degrees in computer software and theory from the University of Electronic Science and Technology of China, in 1998, 2001, and 2006, respectively. From 2001 to 2007, he was with the University of Electronic Science and Technology of China as an Assistant Lecturer and a Lecturer. He was a Post-Doctoral Research Fellow with the School of Engineering, Deakin University, Burwood, VIC, Australia, from 2007 to 2009. He is currently a Professor with the Machine Intelligence Laboratory, College of Computer Science, Sichuan University, China. His research interests include artificial intelligence and big data

Dapeng Wu (S’98–M’04–SM’06–F’13) received the Ph.D. degree in electrical and computer engineering from Carnegie Mellon University, Pittsburgh, PA, USA, in 2003. He is a Professor with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA. His research interests include networking, communications, signal processing, computer vision, machine learning, smart grid, and information and network security
Corresponding author: Dezhong Peng, e-mail: pengdz@scu.edu.cn
Received Date: 2020-04-11
Revised Date: 2020-05-23
Accepted Date: 2020-06-18

Available Online: 2020-07-08

Abstract

Abstract

In this paper, we develop a novel global-attention-based neural network (GANN) for vision language intelligence, specifically, image captioning (language description of a given image). As many previous works, the encoder-decoder framework is adopted in our proposed model, in which the encoder is responsible for encoding the region proposal features and extracting global caption feature based on a specially designed module of predicting the caption objects, and the decoder generates captions by taking the obtained global caption feature along with the encoded visual features as inputs for each attention head of the decoder layer. The global caption feature is introduced for the purpose of exploring the latent contributions of region proposals for image captioning, and further helping the decoder better focus on the most relevant proposals so as to extract more accurate visual feature in each time step of caption generation. Our GANN is implemented by incorporating the global caption feature into the attention weight calculation phase in the word predication process in each head of the decoder layer. In our experiments, we qualitatively analyzed the proposed model, and quantitatively evaluated several state-of-the-art schemes with GANN on the MS-COCO dataset. Experimental results demonstrate the effectiveness of the proposed global attention mechanism for image captioning.
- Global attention,
- image captioning,
- latent contribution

FullText(HTML)

References(43)

References

[1]	K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 2048–2057.
[2]	R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural language models,” in Proc. 31st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 595–603.
[3]	Q. Z. You, H. L. Jin, Z. W. Wang, C. Fang, and J. B. Luo, “Image captioning with semantic attention,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 4651–4659.
[4]	P. Anderson, X. D. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 6077–6086.
[5]	J. S. Lu, C. M. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 375–383.
[6]	O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 3156–3164.
[7]	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 5998–6008.
[8]	L. Huang, W. M. Wang, Y. X. Xia, and J. Chen, “Adaptively aligned image captioning via adaptive attention time,” in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 8940–8949.
[9]	J. Wu, T. S. Chen, H. F. Wu, Z. Yang, Q. Wang, and L. Lin, “Concrete image captioning by integrating content sensitive and global discriminative objective,” in Proc. IEEE Int. Conf. Multimedia and Expo, Shanghai, China, 2019, pp. 1306–1311.
[10]	J. X. Gu, S. Joty, J. F. Cai, H. D. Zhao, X. Yang, and G. Wang, “Unpaired image captioning via scene graph alignments,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Korea (South), 2019, pp. 10323–10332.
[11]	L. Huang, W. M. Wang, J. Chen, and X. Y. Wei, “Attention on attention for image captioning,” arXiv e-prints, page arXiv: 1908.06954, Aug. 2019.
[12]	K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 770–778.
[13]	S. Q. Ren, K. M. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 91–99.
[14]	K. M. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2961–2969.
[15]	D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv: 1409.0473, May 2016.
[16]	J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv: 1810.04805, May 2019.
[17]	Y. C. Xu, X. D. Liu, Y. L. Shen, J. J. Liu, and J. F. Gao, “Multi-task learning with sample re-weighting for machine reading comprehension,” arXiv preprint arXiv: 1809.06963, Mar. 2019.
[18]	M. Soh, “Learning CNN-LSTM architectures for image caption generation,” Stanford Univ., Stanford, USA, 2016.
[19]	M. H. Chen, G. G. Ding, S. C. Zhao, H. Chen, J. G. Han, and Q. Liu, “Reference based LSTM for image captioning,” in Proc. 31st AAAI Conf. Artificial Intelligence, San Francisco, USA, 2017.
[20]	S. M. Lakew, M. Cettolo, and M. Federico, “A comparison of transformer and recurrent neural networks on multilingual neural machine translation,” arXiv preprint arXiv: 1806.06957, Jun. 2018.
[21]	C. G. Wang, Mu Li, and A. J. Smola, “Language models with transformers,” arXiv preprint arXiv: 1904.09408, Oct. 2019.
[22]	J. Yu, J. Li, Z. Yu, and Q. M. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE Trans. Circuits Syst. Video Technol, vol. 30, no. 12, pp. 4467–4480, Dec. 2020. doi: 10.1109/TCSVT.2019.2947482
[23]	S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” arXiv preprint arXiv: 1906.05963, Jan. 2020.
[24]	M. T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv: 1508.04025, Sep. 2015.
[25]	M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” arXiv preprint arXiv: 1912.08226, Mar. 2020.
[26]	T. Yao, Y. W. Pan, Y. H. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 684–699.
[27]	L. L. Gao, Z. Guo, H. W. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based LSTM and semantic consistency,” IEEE Trans. Multimed., vol. 19, no. 9, pp. 2045–2055, Sep. 2017. doi: 10.1109/TMM.2017.2729019
[28]	L. L. Gao, X. P. Li, J. K. Song, and H. T. Shen, “Hierarchical LSTMS with adaptive attention for visual captioning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 5, pp. 1112–1131, May 2019.
[29]	Z. Gan, C. Gan, X. D. He, Y. C. Pu, K. Tran, J. F. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 5630–5639.
[30]	Y. W. Pan, T. Yao, H. Q. Li, and T. Mei, “Video captioning with transferred semantic attributes,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 6504–6512.
[31]	T. Yao, Y. W. Pan, Y. H. Li, Z. F. Qiu, and T. Mei, “Boosting image captioning with attributes,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 4894–4902.
[32]	K. Fu, J. Q. Jin, R. P. Cui, F. Sha, and C. S. Zhang, “Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2321–2334, Dec. 2017. doi: 10.1109/TPAMI.2016.2642953
[33]	F. Liu, T. Xiang, T. M. Hospedales, W. K. Yang, and C. Y. Sun, “Semantic regularisation for recurrent image annotation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 2872–2880.
[34]	S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 7008–7024.
[35]	R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 4566–4575.
[36]	T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.
[37]	A. Karpathy and F. F. Li, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 3128–3137.
[38]	W. H. Jiang, L. Ma, Y. G. Jiang, W. Liu, and T. Zhang, “Recurrent fusion network for image captioning,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 499–515.
[39]	P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” in Proc. 14th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 382–398.
[40]	K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meet. Association for Computational Linguistics, Philadelphia, USA, 2002, pp. 311–318.
[41]	M. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Proc. 9th Workshop on Statistical Machine Translation, Baltimore, USA, 2014, pp. 376–380.
[42]	C. Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. Workship on Text Summarization Branches Out, Barcelona, Spain, 2004, pp. 74–81.
[43]	S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 1171–1179.

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(8) / Tables(3)

Get Citation

PDF

XML

Article Metrics

Article views (2461) PDF downloads(90)

Highlights

Proposed a model in which the global information is incorporated into the attention weight calculation process. The number of local regions is larger than the actual object appeared in a sentence; the we want to activate local regions as less as possible to avoid noises.
Experiment analysis;
A multi-task learning approach, in which the global information extraction and training strategy.

Global-Attention-Based Neural Networks for Vision Language Intelligence

doi: 10.1109/JAS.2020.1003402

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content