reinforce learning · • deep deterministic policy(이하 dpg)를 기반으로 합니다. •...

Reinforce Learning

-DDPG (Deep Deterministic Policy Gradient)

2018.11.19

1

Policy Gradient

2

History

• Value Funciton approach

• 정통적인 강화학습에서 사용되는 방식

• state에서 value function이 최대화 되는 action을 선택함

• value function이 변하면 다른 action을 하게 됨

• 이는 deterministic policy를 찾기에는 유용하나 stochastic하지 못함

• Policy Search

• https://icml.cc/2015/tutorials/PolicySearch.pdf

• high-dimensional space 문제를 해결하기 위해서는 value function은 유용하지 않고 Model-based learning은 비용이 크다.

• Randomized Policy들을 성능측정을 하고 policy를 구성하는 parameter 들의 변화량에 주목한다.

• gradient-descent algorithms 방식을 사용해서 Expected return 을 최대화하는 방향으로 policy를 update 한다.

3

https://icml.cc/2015/tutorials/PolicySearch.pdf

History

4

History

• finite different (FD) gradient : supervised learning의 regression으로 approximate 하는 방법

• Likelihood Ratio Gradient : performance measure J(θ) 를 trajectory에 대해서 gradient로 편미분하고 Monte-Carlo simulation를 사용하여 학습

• A vanilla policy gradient algorithm: 기본적인 방법을 수렴할때까지 반복하면서 파라미터의 변화에 따라 policy를 업데이트 하는 방법

• Natural policy gradient : steepest descent method 을 사용하는 방법

5

History

• Natural policy gradient

6

History


7

History


8

DDPG motive

• action space가 discrete space가 아닌 continuous space에 적용할 수 있다.

• one dimension에 action space는 +1/0/-1 와 같이 3가지 경우를 가짐

• 그런데, one dimension이더라도 실수값의 제어변수를 가지는 정밀 제어를 해야 하는 경우는 continuous action space를 고려해야함

• high dimension 일 경우 더욱 더 어려워짐

9

DDPG 특징

• neural network actor 및 critic을 채용하여 state space와 action space가 모두 큰 dimension 또는 continuous space일 때 사용 가능하도록 한다.

• neural network function approximator를 사용하게 되면 생길 수 있는 문제가 존재함으로 이를 위해 DQN의 replay buffer를 사용한다. (학습에 사용되는 sample은 iid(independently and identically distributed) 특징을 가져야 한다.)

• action value network와 target network를 별도로 분리하여 ‘soft’ target update가 되도록 한다.

• 값들의 절대값의 order of magnitude문제에 의한 학습 성능 저하를 막기위해 사용하는 batch normalization도 적용

• noise process로 Ornstein-Uhlenbeck process를 사용

10

DDPG

• Model-free, Off-policy, Actor-critic algorithm을 제안합니다.

• Deep Deterministic Policy(이하 DPG)를 기반으로 합니다.

• Actor-Critic approach와 DQN의 성공적이었던 부분을 합칩니다.

• Replay buffer : 샘플들 사이의 상관관계를 줄여줍니다.

• target Q Network : Update 동안 target을 안정적으로 만듭니다.

11

DDPG

12

Google Colab

• Google Colaboratory 서비스 개요

• Google Drive + Jupyter Notebook 을 온라인상에서 사용할 수 있음

• 무료로 GPU 사용시 최대 12시간 가능함

• Github 연동 지원도 가능함

• https://colab.research.google.com/

• 컴퓨터 사양(2018년11월)

• Ubuntu 18.04.1 LTS

• Intel(R) Xeon(R) CPU @ 2.30GHz

• MemTotal: 13335212 kB

• overlay 359G 7.6G 333G 3% /

• Tesla K80

13

https://colab.research.google.com/

Google Colab

• 유닉스 명령어를 사용 가능

• OS 확인 : !cat /etc/issue.net

• CPU : !cat /proc/cpuinfo

• Memory : !cat /proc/meminfo • Disk : !df -h • GPU : !nvidia-smi

• 디렉토리 이동

• !ls

• !pwd • !mkdir test

• !cd test

14

Google Colab

• GPU 런타임 설정 방법

• 매뉴 - 런타임 - 런타임 유형 변경

15

Google Colab

• Anaconda 설치

• gym Atari

16

# Anaconda ! wget https://repo.continuum.io/archive/Anaconda2-5.1.0-Linux-x86_64.sh ! bash Anaconda2-5.1.0-Linux-x86_64.sh -b -p ./anaconda

# 설치 !pip install gym !pip install gym[atari]

# 실행 import gym from IPython import display import matplotlib.pyplot as plt %matplotlib inline

env = gym.make('Breakout-v0') env.reset() for _ in range(100): plt.imshow(env.render(mode='rgb_array')) display.display(plt.gcf()) display.clear_output(wait=True) action = env.action_space.sample() env.step(action)

Google Colab

• 로컬 파일 서버에 업로드 방법

• 기타 참조

• https://zzsza.github.io/data/2018/08/30/google-colab/

• https://medium.com/lean-in-women-in-tech-india/google-colab-the-beginners-guide-5ad3b417dfa

17

from google.colab import files uploaded = files.upload()

https://zzsza.github.io/data/2018/08/30/google-colab/

https://medium.com/lean-in-women-in-tech-india/google-colab-the-beginners-guide-5ad3b417dfa

https://medium.com/lean-in-women-in-tech-india/google-colab-the-beginners-guide-5ad3b417dfa

DDPG using Gym in Colab

• https://colab.research.google.com/drive/1ld75VDjf1PUzuNxu40SQjrJzNh8H0NoQ#scrollTo=w17dfCUsfmGu

•

18

env = gym.make('MountainCarContinuous-v0') # get size of state and action from environment state_size = env.observation_space.shape[0] action_size = env.action_space.shape[0]

# make DDPG agent agent = DDPGAgent(state_size, action_size)

global_step = 0 scores, episodes = [], []

https://colab.research.google.com/drive/1ld75VDjf1PUzuNxu40SQjrJzNh8H0NoQ#scrollTo=w17dfCUsfmGu

https://colab.research.google.com/drive/1ld75VDjf1PUzuNxu40SQjrJzNh8H0NoQ#scrollTo=w17dfCUsfmGu


• 700회 학습결과 및 테스트

19

episode: 0 score: -8.371174799146806 max reward: 4.999999999609565 step: 999 epsilon: 0.9900596848432421

episode: 1 score: 90.5272650517126 max reward: 104.99984818275153 step: 1786 epsilon: 0.9822984568086969






…










• 700회 학습결과 및 테스트

20


• MountainCarContinuous-v0

21


• MountainCarContinuous-v0

• https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py

• env.step() 함수

• input

• action

• output • state = position, velocity

• reward

• done

22

https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py

https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py

Collecting Real data

• [선물 체결] Dscbo1.FutureCurOnly

• http://money2.creontrade.com/e5/mboard/ptype_basic/HTS_Plus_Helper/DW_Basic_Read_Page.aspx?boardseq=285&seq=38&page=2&searchString=%EC%84%A0%EB%AC%BC&p=&v=&m=

•

23

http://money2.creontrade.com/e5/mboard/ptype_basic/HTS_Plus_Helper/DW_Basic_Read_Page.aspx?boardseq=285&seq=38&page=2&searchString=%EC%84%A0%EB%AC%BC&p=&v=&m=




Collecting Real data

• python MTS

24

RealData on Machine Trading System

• Mysql

• 과거 데이터를 저장

• 백테스팅 기능

• Realtime analytics

• 실시간으로 데이터 처리

• 실시간 거래 알고리즘 수행

• 실시간 텔레그램 메세지 전송

25


• Mysql

• 과거 데이터를 저장

26


• Mysql

• 백테스팅 기능

27


• 실시간 데이터 처리

28

reinforce learning · • deep deterministic policy(이하 dpg)를 기반으로 합니다. •...

Documents