Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/426487
Title: Algorithms for Challenges to Practical Reinforcement Learning
Researcher: Sindhu, P R
Guide(s): Bhatnagar, Shalabh
Keywords: Automation and Control Systems
Computer Science
Engineering and Technology
University: Indian Institute of Science Bangalore
Completed Date: 2020
Abstract: Reinforcement learning (RL) in real world applications faces major hurdles - the foremost being safety of the physical system controlled by the learning agent and the varying environment conditions in which the autonomous agent functions. A RL agent learns to control a system by exploring available actions. In some operating states, when the RL agent exercises an exploratory action, the system may enter unsafe operation, which can lead to safety hazards both for the system as well as for humans supervising the system. RL algorithms thus need to respect these safety constraints and must do so with limited available information. Additionally, RL autonomous agents learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with non-stationary environments, and in these scenarios, RL algorithms yield sub-optimal decisions. In this thesis, the first part develops algorithmic solutions to the challenges of safety and non-stationary environmental conditions. In order to handle safety restrictions and facilitate safe exploration during learning, this thesis proposes a cross-entropy method based sample efficient learning algorithm. This algorithm is developed on constrained optimization framework and utilizes very limited information for the learning of feasible policies. Also during the learning iterations, the exploration is guided in a manner that minimizes safety violations. In the first part, another algorithm for the second challenge is also described. The goal of this algorithm is to maximize the long-term discounted reward accrued when the latent model of the environment changes with time. To achieve this, the algorithm leverages a change point detection algorithm to find change in the statistics of the environment. The results from this statistical algorithm are used to reset learning of policies...
Pagination: xv, 125
URI: http://hdl.handle.net/10603/426487
Appears in Departments:Computer Science and Automation

Files in This Item:
File Description SizeFormat 
01_title.pdfAttached File168.53 kBAdobe PDFView/Open
02_prelim pages.pdf211.51 kBAdobe PDFView/Open
03_table of content.pdf54.92 kBAdobe PDFView/Open
04_abstract.pdf73.33 kBAdobe PDFView/Open
05_chapter 1.pdf139.44 kBAdobe PDFView/Open
06_chapter 2.pdf737.54 kBAdobe PDFView/Open
07_chapter 3.pdf522.86 kBAdobe PDFView/Open
08_chapter 4.pdf483.26 kBAdobe PDFView/Open
09_chapter 5.pdf386.76 kBAdobe PDFView/Open
10_chapter 6.pdf3.11 MBAdobe PDFView/Open
11_annexure.pdf228.26 kBAdobe PDFView/Open
80_recommendation.pdf259.09 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: