Hierarchical Reinforcement Learning With Automatic Sub-Goal Identification

Chenghao Liu; Fei Zhu; Quan Liu; Yuchen Fu

doi:10.1109/JAS.2021.1004141

Volume 8 Issue 10

Oct. 2021

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2021 > 8(10): 1686-1696

C. H. Liu, F. Zhu, Q. Liu, and Y. C. Fu, "Hierarchical Reinforcement Learning With Automatic Sub-Goal Identification," IEEE/CAA J. Autom. Sinica, vol. 8, no. 10, pp. 1686-1696, Oct. 2021. doi: 10.1109/JAS.2021.1004141

Citation:

C. H. Liu, F. Zhu, Q. Liu, and Y. C. Fu, "Hierarchical Reinforcement Learning With Automatic Sub-Goal Identification," IEEE/CAA J. Autom. Sinica, vol. 8, no. 10, pp. 1686-1696, Oct. 2021. doi: 10.1109/JAS.2021.1004141

Citation:

PDF( 1093 KB)

Hierarchical Reinforcement Learning With Automatic Sub-Goal Identification

doi: 10.1109/JAS.2021.1004141

Chenghao Liu^1
,,
Fei Zhu^{1
,
,},
Quan Liu^1
,,
Yuchen Fu^2
,

1.
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
2.
School of Computer Science and Engineering, Changshu Institute of Technology, Changshu 215500, China

Funds: This work was supported by the National Natural Science Foundation of China (61303108), Suzhou Key Industries Technological Innovation-Prospective Applied Research Project (SYG201804), A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD), and the Fundamental Research Funds for the Gentral Universities, JLU (93K172020K25)

More Information

Abstract

Abstract

In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult. To solve the problem, we propose an algorithm called hierarchical deep reinforcement learning with automatic sub-goal identification via computer vision (HADS) which takes advantage of hierarchical reinforcement learning to alleviate the sparse reward problem and improve efficiency of exploration by utilizing a sub-goal mechanism. HADS uses a computer vision method to identify sub-goals automatically for hierarchical deep reinforcement learning. Due to the fact that not all sub-goal points are reachable, a mechanism is proposed to remove unreachable sub-goal points so as to further improve the performance of the algorithm. HADS involves contour recognition to identify sub-goals from the state image where some salient states in the state image may be recognized as sub-goals, while those that are not will be removed based on prior knowledge. Our experiments verified the effect of the algorithm.
- Hierarchical control,
- hierarchical reinforcement learning,
- option,
- sparse reward,
- sub-goal

FullText(HTML)

References(40)

References

[1]	R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT press, 2018.
[2]	V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. doi: 10.1038/nature14236
[3]	H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proc. 30th AAAI Conf. Artificial Intelligence, 2016, pp. 2094–2100.
[4]	T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in Proc. Advances in Int. Conf. Learning Representations, 2016, pp. 1–21.
[5]	Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2016, pp. 1995–2003.
[6]	R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, vol. 112, no. 1–2, pp. 181–211, 1999. doi: 10.1016/S0004-3702(99)00052-1
[7]	T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” Advances in Neural Information Processing Systems, 2016, pp. 3675–3683.
[8]	M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsight experience replay, ” in Advances in Neural Information Processing Systems, 2017, pp. 5048–5058.
[9]	C. Florensa, Y. Duan, and P. Abbeel, “Stochastic neural networks for hierarchical reinforcement learning,” in Proc. Advances in Int. Conf. Learning Representations, 2017, pp. 1–17.
[10]	H. Le, N. Jiang, A. Agarwal, M. Dudik, Y. Yue, and H. Daumé, “Hierarchical imitation and reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2018, pp. 2923–2932.
[11]	X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne, “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning,” ACM Trans. Graphics, vol. 36, no. 4, pp. 1–13, 2017.
[12]	J. Rafati and D. C. Noelle, “Learning representations in model-free hierarchical reinforcement learning,” in Proc. AAAI Conf. Artificial Intelligence, 2019, pp. 10009–10010.
[13]	M. Imani and U. M. Braga-Neto, “Control of gene regulatory networks using bayesian inverse reinforcement learning,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 16, no. 4, pp. 1250–1261, 2019. doi: 10.1109/TCBB.2018.2830357
[14]	N. Dilokthanakul, C. Kaplanis, N. Pawlowski, and M. Shanahan, “Feature control as intrinsic motivation for hierarchical reinforcement learning,” IEEE Trans. Neural Networks &Learning Systems, vol. 30, no. 11, pp. 3409–3418, 2019.
[15]	H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang, “Hybrid reward architecture for reinforcement learning,” Advances in Neural Information Processing Systems, 2017, pp. 5392– 5402.
[16]	J. Yan, H. He, X. Zhong, and Y. Tang, “Q-learning-based vulnerability analysis of smart grid against sequential topology attacks,” IEEE Trans. Information Forensics &Security, vol. 12, no. 1, pp. 200–210, 2017.
[17]	H. C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers, “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning,” IEEE Trans. Medical Imaging, vol. 35, no. 5, pp. 1285–1298, 2016. doi: 10.1109/TMI.2016.2528162
[18]	B. Hengst, “Hierarchical reinforcement learning,” Encyclopedia of Machine Learning and Data Mining, pp. 611–619, 2017.
[19]	R. E. Parr and S. Russell, Hierarchical Control and Learning for Markov Decision Processes. University of California, Berkeley Berkeley, CA, 1998.
[20]	R. Ramesh, M. Tomar, and B. Ravindran, “Successor options: An option discovery framework for reinforcement learning,” in Proc. 28th Int. Joint Conf. Artificial Intelligence, 2019, pp. 3304–3310.
[21]	T. G. Dietterich, “Hierarchical reinforcement learning with the MAXQ value function decomposition,” Journal of Artificial Intelligence Research, vol. 13, pp. 227–303, 2000. doi: 10.1613/jair.639
[22]	P. Kai, A. Escande, and A. Kheddar, “Singularity resolution in equality and inequality constrained hierarchical task-space control by adaptive non-linear least-squares,” IEEE Robotics &Automation Letters, vol. 3, no. 4, pp. 3630–3637, 2018.
[23]	D. Abel, D. Arumugam, L. Lehnert, and M. Littman, “State abstractions for lifelong reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2018, pp. 10–19.
[24]	Y. Fu, Z. Xu, F. Zhu, Q. Liu, and X. Zhou, “Learn to human-level control in dynamic environment using incremental batch interrupting temporal abstraction,” Computer Science &Information Systems, vol. 13, no. 2, pp. 561–577, 2016.
[25]	A. Neitz, G. Parascandolo, S. Bauer, and B. Schölkopf, “Adaptive skip intervals: Temporal abstraction for recurrent dynamical models,” Advances in Neural Information Processing Systems, 2018, pp. 9816– 9826.
[26]	O. Nachum, S. S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” Advances in Neural Information Processing Systems, 2018, pp. 3303–3313.
[27]	J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” in Proc. 34th Int. Conf. Machine Learning-Volume 70, 2017, pp. 166–175.
[28]	I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel, “Model-based reinforcement learning via meta-policy optimization,” in Proc. Conf. Robot Learning, 2018, pp. 617–629.
[29]	C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2017, pp. 240–248.
[30]	Z. Xu, H. P. van Hasselt, and D. Silver, “Meta-gradient reinforcement learning,” Advances in Neural Information Processing Systems, 2018, pp. 2396–2407.
[31]	A. Garivier, P. Ménard, and G. Stoltz, “Explore first, exploit next: The true shape of regret in bandit problems,” Mathematics of Operations Research, vol. 44, no. 2, pp. 377–399, 2018.
[32]	M. P. Saka, O. Hasancebi, and Z. W. Geem, “Metaheuristics in structural optimization and discussions on harmony search algorithm,” Swarm and Evolutionary Computation, vol. 28, pp. 88–97, 2016. doi: 10.1016/j.swevo.2016.01.005
[33]	N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa, “Learning continuous control policies by stochastic value gradients,” Advances in Neural Information Processing Systems, 2015, pp. 2944–2952.
[34]	J. P. O’Doherty, S. W. Lee, and D. McNamee, “The structure of reinforcement-learning mechanisms in the human brain,” Current Opinion in Behavioral Sciences, vol. 1, pp. 94–100, 2015. doi: 10.1016/j.cobeha.2014.10.004
[35]	A. G. Barto, “Intrinsic motivation and reinforcement learning,” in Intrinsically Motivated Learning in Natural and Artificial Systems. Springer, 2013, pp. 17–47.
[36]	P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” in Proc. 31st AAAI Conf. Artificial Intelligence, 2017, pp. 1726– 1734.
[37]	Z. Zhao, Z. Yan, F. Li, M. Zhao, Z. Li, and S. Yan, “Discriminative sparse flexible manifold embedding with novel graph for robust visual representation and label propagation,” Pattern Recognition, vol. 61, pp. 492–510, 2017. doi: 10.1016/j.patcog.2016.07.042
[38]	C. Wong, N. Houlsby, Y. Lu, and A. Gesmundo, “Transfer learning with neural automl,” Advances in Neural Information Processing Systems, 2018, pp. 8356–8365.
[39]	G. D. Ruxton, “The unequal variance t-test is an underused alternative to student’s t-test and the Mann-Whitney U test,” Behavioral Ecology, vol. 17, no. 4, pp. 688–690, 2006. doi: 10.1093/beheco/ark016
[40]	J. C. F. De Winter, “Using the student’s t-test with extremely small sample sizes,” Practical Assessment Research &Evaluation, vol. 18, no. 10, pp. 1–12, 2013.

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(12) / Tables(7)

Get Citation

PDF

XML

Article Metrics

Article views (1219) PDF downloads(101)

Highlights

The sub-goals are detected automatically via computer vision.
The agent receives an inner reward after completing the sub-goal.
The combined input of the sub-goal and the image achieves a better result.

Hierarchical Reinforcement Learning With Automatic Sub-Goal Identification

doi: 10.1109/JAS.2021.1004141

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content