curriculum-guided hindsight experience replay › files › posters › cher-poster.pdf · title:...

1
Curriculum-guided Hindsight Experience Replay Meng Fang 1 Tianyi Zhou 2 Yali Du 3 Lei Han 1 Zhengyou Zhang 1 1 Tencent AI Lab/Robotics X 2 University of Washington 3 University College London Code (Github) mengf1/CHER Paper (NeurIPS-2019) CHER Overview This work deals with sparse rewards challenges in reinforcement learning (RL) and assumes that not all the failed experiences are equally useful to dif- ferent learning stages. We adopt a human-like learning strategy that en- forces more curiosity in earlier stages and changes to larger goal-proximity later: 1) adaptively select the failed experiences for re- play according to the proximity to true goals and the curiosity of exploration over diverse pseudo goals; 2) gradually change the proportion of the goal- proximity and the diversity-based curiosity in the selection criteria. Our “Goal-and-Curiosity-driven Curriculum Learning” leads to “Curriculum-guided HER (CHER)”, which adaptively and dynamically controls the exploration-exploitation trade-off dur- ing the learning process via hindsight experience selection. Robotics Environments FetchReach (Toy example) HandReach HandManipulate Block HandManipulate Egg HandManipulate Pen There are FetchReach environment and four Shadow Dexterous Hand environments: Han- dReach, Block manipulation, Egg manipulation and Pen manipulation. Methodology In contrast to uniform sampling, we propose to select a subset of achieved goals A B according to max AB,|A|≤k F (A) , λF prox (A)+ F div (A). Goal-proximity: their proximity to the desired goals F prox (A) , iA sim(g i ,g ) Diversity-based curiosity: their diversity that reflects the curiosity of agent exploring different achieved goals in the environment F div (A) , j B max iA sim(g i ,g j ) Utility score: F (i|A)=λ sim(g i ,g )+ X j B max ( 0, sim(g i ,g j ) - max l A sim(g l ,g j ) ) . In practice: kd-tree to build a sparse K -nearest neighbor graph of pseudo goals; lazier than lazy greedy. Experiments Baselines: DDPG, DDPG+HER (uniformly), DDPG+HEREBP (energy function) Toy example – FetchReach - The red points (selected achieved goals) compose a diverse and representative subset of the gray points (all achieved goals), but some are not close to any green point (desired goals) since CHER prefers diversity than proximity in earlier episodes. - Most red points are close to some green points due to the large proximity in later episodes’ selection criteria, but some regions with many gray points concentrated do not contain any red point since CHER prefers proximity more than diversity. Benchmark results – Hand environments CHER learns much faster than other RL methods. Conclusion CHER is the first work that adaptively selects failed experiences for replay according to their compatibility and usefulness to different learning stages of deep RL; A large diversity is beneficial to earlier explo- ration, while a large proximity to the desired goals is essential for effective exploitation in later stages; The sample efficiency and learning speed of off- policy RL algorithms can be significantly im- proved by CHER; Better than other HER-based approaches; CHER does not make assumptions on tasks and environments, and can potentially be generalized to other more complicated tasks, environments and settings. References [Lillicrap et al., 2015] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 [Andrychowicz et al., 2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems [Zhao and Tresp, 2018] R. Zhao and V. Tresp. Energy-based hindsight experience prioritization. In Conference on Robot Learning [Zhou and Bilmes, 2018] T. Zhou and J. Bilmes. Minimax curriculum learning: Machine teaching with desirable difficulties and scheduled di- versity. In International Conference on Learning Representations

Upload: others

Post on 04-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Curriculum-guided Hindsight Experience Replay › files › posters › CHER-poster.pdf · Title: Curriculum-guided Hindsight Experience Replay Author: Meng Fang, Tianyi Zhou, Yali

Curriculum-guided Hindsight Experience ReplayMeng Fang1 Tianyi Zhou2 Yali Du3 Lei Han1 Zhengyou Zhang1

1Tencent AI LabRobotics X 2University of Washington 3University College London

Code (Github)mengf1CHER

Paper (NeurIPS-2019)CHER

Overview

This work deals with sparse rewards challenges inreinforcement learning (RL) and assumes that notall the failed experiences are equally useful to dif-ferent learning stages

We adopt a human-like learning strategy that en-forces more curiosity in earlier stages and changesto larger goal-proximity later1) adaptively select the failed experiences for re-play according to the proximity to true goals andthe curiosity of exploration over diverse pseudogoals2) gradually change the proportion of the goal-proximity and the diversity-based curiosity in theselection criteria

Our ldquoGoal-and-Curiosity-driven CurriculumLearningrdquo leads to ldquoCurriculum-guided HER(CHER)rdquo which adaptively and dynamicallycontrols the exploration-exploitation trade-off dur-ing the learning process via hindsight experienceselection

Robotics Environments

FetchReach(Toy example) HandReach HandManipulate

BlockHandManipulate

EggHandManipulate

Pen

There are FetchReach environment and fourShadow Dexterous Hand environments Han-dReach Block manipulation Egg manipulationand Pen manipulation

Methodology

In contrast to uniform sampling we propose to select a subset of achieved goals A sube B according tomax

AsubeB|A|lekF (A) λFprox(A) + Fdiv(A)

bullGoal-proximity their proximity to the desired goals Fprox(A) sumiisinA sim(gi g)

bullDiversity-based curiosity their diversity that reflects the curiosity of agent exploring different achievedgoals in the environment Fdiv(A) sum

jisinB maxiisinA sim(gi gj)bullUtility score

F (i|A) =λ sim(gi g) +sumjisinB

max0 sim(gi gj)minusmax

lisinAsim(gl gj)

In practice kd-tree to build a sparse K-nearest neighbor graph of pseudo goals lazier than lazy greedy

Experiments

Baselines DDPG DDPG+HER (uniformly) DDPG+HEREBP (energy function)Toy example ndash FetchReach

- The red points (selected achieved goals) compose a diverse and representative subset of the gray points(all achieved goals) but some are not close to any green point (desired goals) since CHER prefers diversitythan proximity in earlier episodes- Most red points are close to some green points due to the large proximity in later episodesrsquo selectioncriteria but some regions with many gray points concentrated do not contain any red point since CHERprefers proximity more than diversityBenchmark results ndash Hand environments

CHER learns much faster than other RL methods

Conclusion

bullCHER is the first work that adaptively selectsfailed experiences for replay according to theircompatibility and usefulness to different learningstages of deep RLbullA large diversity is beneficial to earlier explo-ration while a large proximity to the desiredgoals is essential for effective exploitation in laterstagesbullThe sample efficiency and learning speed of off-policy RL algorithms can be significantly im-proved by CHERbullBetter than other HER-based approachesbullCHER does not make assumptions on tasks andenvironments and can potentially be generalizedto other more complicated tasks environmentsand settings

References

[Lillicrap et al 2015] Timothy P Lillicrap Jonathan J Hunt AlexanderPritzel Nicolas Heess Tom Erez Yuval Tassa David Silver and DaanWierstra Continuous control with deep reinforcement learning arXivpreprint arXiv150902971[Andrychowicz et al 2017] Marcin Andrychowicz Filip Wolski Alex RayJonas Schneider Rachel Fong Peter Welinder Bob McGrew Josh TobinPieter Abbeel and Wojciech Zaremba Hindsight experience replay InAdvances in Neural Information Processing Systems[Zhao and Tresp 2018] R Zhao and V Tresp Energy-based hindsightexperience prioritization In Conference on Robot Learning[Zhou and Bilmes 2018] T Zhou and J Bilmes Minimax curriculumlearning Machine teaching with desirable difficulties and scheduled di-versity In International Conference on Learning Representations