Please use this identifier to cite or link to this item:
http://hdl.handle.net/10603/426487
Title: | Algorithms for Challenges to Practical Reinforcement Learning |
Researcher: | Sindhu, P R |
Guide(s): | Bhatnagar, Shalabh |
Keywords: | Automation and Control Systems Computer Science Engineering and Technology |
University: | Indian Institute of Science Bangalore |
Completed Date: | 2020 |
Abstract: | Reinforcement learning (RL) in real world applications faces major hurdles - the foremost being safety of the physical system controlled by the learning agent and the varying environment conditions in which the autonomous agent functions. A RL agent learns to control a system by exploring available actions. In some operating states, when the RL agent exercises an exploratory action, the system may enter unsafe operation, which can lead to safety hazards both for the system as well as for humans supervising the system. RL algorithms thus need to respect these safety constraints and must do so with limited available information. Additionally, RL autonomous agents learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with non-stationary environments, and in these scenarios, RL algorithms yield sub-optimal decisions. In this thesis, the first part develops algorithmic solutions to the challenges of safety and non-stationary environmental conditions. In order to handle safety restrictions and facilitate safe exploration during learning, this thesis proposes a cross-entropy method based sample efficient learning algorithm. This algorithm is developed on constrained optimization framework and utilizes very limited information for the learning of feasible policies. Also during the learning iterations, the exploration is guided in a manner that minimizes safety violations. In the first part, another algorithm for the second challenge is also described. The goal of this algorithm is to maximize the long-term discounted reward accrued when the latent model of the environment changes with time. To achieve this, the algorithm leverages a change point detection algorithm to find change in the statistics of the environment. The results from this statistical algorithm are used to reset learning of policies... |
Pagination: | xv, 125 |
URI: | http://hdl.handle.net/10603/426487 |
Appears in Departments: | Computer Science and Automation |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
01_title.pdf | Attached File | 168.53 kB | Adobe PDF | View/Open |
02_prelim pages.pdf | 211.51 kB | Adobe PDF | View/Open | |
03_table of content.pdf | 54.92 kB | Adobe PDF | View/Open | |
04_abstract.pdf | 73.33 kB | Adobe PDF | View/Open | |
05_chapter 1.pdf | 139.44 kB | Adobe PDF | View/Open | |
06_chapter 2.pdf | 737.54 kB | Adobe PDF | View/Open | |
07_chapter 3.pdf | 522.86 kB | Adobe PDF | View/Open | |
08_chapter 4.pdf | 483.26 kB | Adobe PDF | View/Open | |
09_chapter 5.pdf | 386.76 kB | Adobe PDF | View/Open | |
10_chapter 6.pdf | 3.11 MB | Adobe PDF | View/Open | |
11_annexure.pdf | 228.26 kB | Adobe PDF | View/Open | |
80_recommendation.pdf | 259.09 kB | Adobe PDF | View/Open |
Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Altmetric Badge: