Approximate policy iteration: a survey and some new methods

来源 :Journal of Control Theory and Applications | 被引量 : 0次 | 上传用户:snowdrangon
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
We consider the classical policy iteration method of dynamic programming(DP),where approximations and simulation are used to deal with the curse of dimensionality.We survey a number of issues:convergence and rate of convergence of approximate policy evaluation methods,singularity and susceptibility to simulation noise of policy evaluation,exploration issues,constrained and enhanced policy iteration,policy oscillation and chattering,and optimistic and distributed policy iteration.Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches:projected equations and temporal differences(TD),and aggregation.In the context of these approaches,we survey two different types of simulation-based algorithms:matrix inversion methods,such as least-squares temporal difference(LSTD),and iterative methods,such as least-squares policy evaluation(LSPE) and TD(λ),and their scaled variants.We discuss a recent method,based on regression and regularization,which recti?es the unreliability of LSTD for nearly singular projected Bellman equations.An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE.Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees.We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation,but when done by aggregation it does not.This implies better error bounds and more regular performance for aggregation,at the expense of some loss of generality in cost function representation capability.Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation,and is characterized by favorable error bounds. We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation.In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversions methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD ( ), and their scaled variants. We discuss a recent method, based on regression and regularization, which recti? es the unreliability of LSTD for nearly singular projected Bellman equations. Ann iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE.Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees .We illustrate that policy evaluation when done by the projected equation / TD approach may lead to policy oscillation, but when done by aggregation it does not.This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation / TD-based and aggregation-based policy evaluation , and is characterized by favorable error bounds.
其他文献
在外语自主学习过程中,电子交流起到了非常重要的辅助作用,使学生获取知识的渠道多元化。本文谈到电子交流有助于学生外语自主学习,并且针对学生在自主学习过程中,如何运用电
英语初学者常遇到这样的情形:看得懂每个单词,可就是无法理解全文要表达的信息.这极可能是因为英语初学者不太熟悉英语成语,英语成语的实际意义往往不等于其字面意义的总和.
《动画技法》要求有配套的专业绘制工具,而对于并非以此为主打专业的技工学校来说,根本无心配备,采用其他常见工具和有效的绘制办法,依然可以顺利开展教学。同时,这种对学生
高级英语教学要求提高学生的英语语言技能,强化跨文化交际的能力.而文化图式的构建对提升高级英语教学效果有着重要的意义.本文结合高级英语教学,指出了在词汇,句子和语篇方
分析化学实验课是林业院校相关专业的重要基础课,围绕着训练学生基本操作技能,激发学生创新意识,对课程进行了一些改革和探索,突出课程的实践性特色,优化教学内容,改进教学方
本文通过对英语缩略语的介绍,分析了常用英语缩略语的构成、分类,并结合其在汉语中的使用状况,分析产生的一些问题及其原因,最后结合实际,提出了一些建议。
很多大学生刚入大学无法适应大学强调自主学习的模式,本文从学习风格和学习策略的定义入手,讲解了学习风格和学习策略的分类,旨在帮助大学生对自己的学习风格有一定的了解,并
焦虑是影响听力的一个重要的情感因素。本文分析了产生焦虑的原因,影响,提出焦虑的因素既有主观的,也有客观的。认为应提高大学生和教师对焦虑对英语听力影响的认识,在学习中
日语口语交际能力,是衡量学生实际应用能力的重要标准,是制约日语综合水平提高的瓶颈。要实现口语的发展,教师必须优化课堂教学,给学生提供更多的练习机会。只有通过多角度的
本文通过估算权系数,建立一个-2齐次核的Hilbert型不等式及其等价式,并考虑了其逆向不等式的情形.