Continuous Action Reinforcement Learning for Control-Affine Systems with Unknown Dynamics

Aleksandra Faust; Peter Ruymgaart; Molly Salman; Rafael Fierro; Lydia Tapia

Volume 1 Issue 3

Jul. 2014

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2014 > 1(3): 323-336

Aleksandra Faust, Peter Ruymgaart, Molly Salman, Rafael Fierro and Lydia Tapia, "Continuous Action Reinforcement Learning for Control-Affine Systems with Unknown Dynamics," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 323-336, 2014.

Citation:

Aleksandra Faust, Peter Ruymgaart, Molly Salman, Rafael Fierro and Lydia Tapia, "Continuous Action Reinforcement Learning for Control-Affine Systems with Unknown Dynamics," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 323-336, 2014.

Citation:

Aleksandra Faust, Peter Ruymgaart, Molly Salman, Rafael Fierro and Lydia Tapia, "Continuous Action Reinforcement Learning for Control-Affine Systems with Unknown Dynamics," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 323-336, 2014.

PDF( 2763 KB)

Continuous Action Reinforcement Learning for Control-Affine Systems with Unknown Dynamics

1. Departmentof Computer Science, University of New Mexico, Albuquerque,NM87131, USA;
2. Computer Science Department, Austin College,Sherman, TX75090, USA;
3. Department of Electrical and Computer Engineering,University of New Mexico, Albuquerque, NM87131, USA

Funds:

This work was supported by New Mexico Space Grant, Computing Research Association CRA-W Distributed Research Experience for Undergraduates, NSF (ECCS #1027775), Army Research Laboratory (#W911NF-08-2-0004), National Institutes of Health (NIH) (P20GM110907) to the Center for Evolutionary and Theoretical Immunology.

Abstract

Abstract

Control of nonlinear systems is challenging in realtime. Decision making, performed many times per second, must ensure system safety. Designing input to perform a task often involves solving a nonlinear system of differential equations, which is a computationally intensive, if not intractable problem. This article proposes sampling-based task learning for controlaffine nonlinear systems through the combined learning of both state and action-value functions in a model-free approximate value iteration setting with continuous inputs. A quadratic negative definite state-value function implies the existence of a unique maximum of the action-value function at any state. This allows the replacement of the standard greedy policy with a computationally efficient policy approximation that guarantees progression to a goal state without knowledge of the system dynamics. The policy approximation is consistent, i.e., it does not depend on the action samples used to calculate it. This method is appropriate for mechanical systems with high-dimensional input spaces and unknown dynamics performing Constraint-Balancing Tasks. We verify it both in simulation and experimentally for an Unmanned Aerial Vehicles (UAVs) carrying a suspended load, and in simulation, for the rendezvous of heterogeneous robots.
- Reinforcement learning,
- policy approximation,
- approximate value iteration,
- fitted value iteration,
- continuous action spaces,
- control-affine nonlinear systems

FullText(HTML)

References(33)

References

[1]	Levine J. Analysis and control of nonlinear systems: a flatness-based approach. Mathematical Engineering. New York: Springer, 2009.
[2]	Khalil H K. Nonlinear Systems. New Jersey: Prentice Hall, 1996.
[3]	Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton, Florida: CRC Press, 2010.
[4]	Bertsekas D P, Tsitsiklis J N. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996.
[5]	Ernst D, Glavic M, Geurts P, Wehenkel L. Approximate value iteration in the reinforcement learning context. application to electrical power system control. International Journal of Emerging Electric Power Systems, 2005, 3(1): 10661-106637
[6]	Taylor C J, Cowley A. Parsing indoor scenes using RGB-D imagery. In: Proceeding of Robotics: Sci. Sys. (RSS), Sydney, Australia, 2012.
[7]	LaValle S M. Planning Algorithms. Cambridge, U.K.: Cambridge University Press, 2006.
[8]	Yucelen T, Yang B-J, Calise A J. Derivative-free decentralized adaptive control of large-scale interconnected uncertain systems. In: Proceeding of the 50th Conference on Decision and Control and European Control Conference (CDC-ECC). Orlando, USA: IEEE, 2011. 1104-1109
[9]	Mehraeen S, Jagannathan S. Decentralized optimal control of a class of interconnected nonlinear discrete-time systems by using online Hamilton-Jacobi-Bellman formulation. IEEE Transactions on Neural Networks, 2011, 22(11): 1757-1769
[10]	Dierks T, Jagannathan S. Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using timebased policy update. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23(7): 1118-1129
[11]	Vamvoudakis K G, Vrabie D, Lewis F L. Online adaptive algorithm for optimal control with integral reinforcement learning. International Journal of Robust and Nonlinear Control, to be published
[12]	Mehraeen S, Jagannathan S. Decentralized nearly optimal control of a class of interconnected nonlinear discrete-time systems by using online Hamilton-Bellman-Jacobi formulation. In: Proceeding of the 2010 International Joint Conference on Neural Networks (IJCNN). Barcelona: IEEE, 2010. 1-8
[13]	Bhasin S, Sharma N, Patre P, Dixon W. Asymptotic tracking by a reinforcement learning-based adaptive critic controller. Journal of Control Theory and Applications, 2011, 9(3): 400-409
[14]	Modares H, Sistani M B N, Lewis F L. A policy iteration approach to online optimal control of continuous-time constrained-input systems. ISA Transactions, 2013, 52(5): 611-621
[15]	Chen Z, Jagannathan S. Generalized Hamilton-Jacobi-Bellman formulation-based neural network control of affine nonlinear discrete time systems. IEEE Transactions on Neural Networks, 2008, 19(1): 90-106
[16]	Jiang Y, Jiang Z P. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica, 2012, 48(10): 2699-2704
[17]	Al-Tamimi A, Lewis F, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943-949
[18]	Cheng T, Lewis F L, Abu-Khalaf M. A neural network solution for fixed-final time optimal control of nonlinear systems. Automatica, 2007, 43(3): 482-490
[19]	Kober J, Bagnell D, Peters J. Reinforcement learning in robotics: a survey. International Journal of Robotics Research, 2013, 32(11): 1236-1274
[20]	Hasselt H. Reinforcement learning in continuous state and action spaces. Adaptation, Learning, and Optimization. Berlin Heidelberg: Springer, 2012. 207-251
[21]	Grondman I, Busoniu L, Lopes G A D, Babuska R. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(6): 1291-1307
[22]	Kimura H. Reinforcement learning in multi-dim ensional state-action space using random rectangular coarse coding and Gibbs sampling. In: Proceeding of the 2007 IEEE International Conference on Intelligent Robots and Systems (IROS). San Diego, CA: IEEE, 2007. 88-95
[23]	Lazaric A, Restelli M, Bonarini A. Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. Advances in Neural Information Processing Systems, 2008, 20: 833-840
[24]	Antos A, Munos R, Szepesvári C. Fitted Q-iteration in continuous action-space MDPs. Advances in Neural Information Processing Systems 20. Cambridge, MA: MIT Press, 2007. 9-16
[25]	Bubeck S, Munos R, Stoltz G, Szepesvari C. X-armed bandits. The Journal of Machine Learning Research, 2011, 12: 1655-1695
[26]	Buęoniu L, Daniels A, Munos R, Babuška R. Optimistic planning for continuous-action deterministic systems. In: Proceeding of the 2013 Symposium on Adaptive Dynamic Programming and Reinforcement Learning. Singapore: IEEE, 2013. 69-76
[27]	Mansley C, Weinstein A, Littman M L. Sample-based planning for continuous action Markov decision processes. In: Proceeding of the 21st International Conference on Automated Planning and Scheduling. Piscataway, NL, USA: ICML, 2011. 335-338
[28]	Walsh T J, Goschin S, Littman M L. Integrating sample-based planning and model-based reinforcement learning. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence. Atlanta, Georgia, USA: AAAI Press, 2010. 612-617
[29]	Faust A, Palunko I, Cruz P, Fierro R, Tapia L. Learning swing freetrajectories for UAVs with a suspended load. In: Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA). Karlsruhe, Germany: IEEE, 2013. 4902-4909
[30]	Faust A, Palunko I, Cruz P, Fierro R, Tapia L, Automated aerial suspended cargo delivery through reinforcement learning. In: Artificial Intelligence, 2015, in press.
[31]	Bellman R E. Dynamic Programming. Mineola, NY: Dover Publications, Incorporated, 1957.
[32]	Sutton R S, Barto A G. A Reinforcement Learning: an Introduction. Cambridge, MA: MIT Press, 1998.
[33]	Munos R, Szepesvári C. Finite-time bounds for Fitted Value Iteration. Journal of Machine Learning Research, 2008, 9: 815-857

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Get Citation

PDF

XML

Article Metrics

Article views (1233) PDF downloads(24)

Continuous Action Reinforcement Learning for Control-Affine Systems with Unknown Dynamics

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content