딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 ai 구현하기 deview 2016
TRANSCRIPT
-
AI
-
EMNLP,DataCom,IJBDI
http://carpedm20.github.io
-
+
-
PlayingAtariwithDeepReinforcementLearning(2013)
https://deepmind.com/research/dqn/
-
MasteringthegameofGowithdeepneuralnetworksandtreesearch(2016)
https://deepmind.com/research/alphago/
-
(2016)
http://vizdoom.cs.put.edu.pl/
-
+
-
+
-
+ ?ReinforcementLearning
-
Machine Learning
-
?
?
-
?
?
Classification
-
?
?
?
?
-
Clustering
-
?
?
?
-
?
-
( )
-
( )
-
(ReinforcementLearning)
-
Agent
-
Environment
Agent
-
Environment
Agent
StateE
-
Environment
Agent
StateE
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
-
Environment
Agent
StateE ActionE = 2
-
Environment
Agent
ActionE = 2
StateE
-
Environment
Agent
ActionE = 2StateE RewardE = 1
-
Environment
Agent
ActionE = 2StateE RewardE = 1
-
Environment
Agent
StateE
-
Environment
Agent
ActionE = 0StateE
-
Environment
Agent
ActionE = 0StateE RewardE = 1
-
,
Agent action
action state
reward
: reward action
-
Environment
Agent
ActionaJStatesJ RewardrJ
ReinforcementLearning
-
,
-
DL+RL ?
-
AlphaRun
AlphaRun
-
+
-
?
-
?
-
30 ( 8)
30
9 (2 )
7
4
30830MO74=
5,040
-
4
1 6
30830MO746 =
14
-
AlphaRun
AlphaRun
-
A.I.
-
AI 8
1. DeepQ-Network(2013)
2. DoubleQ-Learning(2015)
3. DuelingNetwork(2015)
4. PrioritizedExperienceReplay(2015)
5. Model-freeEpisodicControl(2016)
6. AsynchronousAdvantageousActor-Criticmethod(2016)
7. HumanCheckpointReplay(2016)
8. GradientDescentwithRestart(2016)
-
AI 8
1. DeepQ-Network(2013)
2. DoubleQ-Learning(2015)
3. DuelingNetwork(2015)
4. PrioritizedExperienceReplay(2015)
5. Model-freeEpisodicControl(2016)
6. AsynchronousAdvantageousActor-Criticmethod(2016)
7. HumanCheckpointReplay(2016)
8. GradientDescentwithRestart(2016)
-
1.DeepQ-Network
-
StateE0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 -1 -1
0 1 1 3 3 -1 -1 -1
0 1 1 0 0 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1
-
ActionE =?0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 -1 -1
0 1 1 3 3 -1 -1 -1
0 1 1 0 0 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1
ActionE
None 0
? ??
-
ActionE =?0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 -1 -1
0 1 1 3 3 -1 -1 -1
0 1 1 0 0 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1
E
-
ActionE =?0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 -1 -1
0 1 1 3 3 -1 -1 -1
0 1 1 0 0 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1
E
Q
-
QState s Action a Q
s, a
-
s Q s,Q s,
Q s,
-
s Q s, = 5Q s, = 0
Q s, = 10
-
s Q s, = 5Q s, = 0
Q s, = 10
-
Q:
-
s =
-
s =
-
Q
-
+1+1+5+
+1+1+
+0+1+
Q
-
-1-1-1+
+1+1+
-1+1-1+
Q
-
= r + max[`
, ` , O
-
?
-
#1...
,
-
2.DoubleQ-Learning
-
,
-
or
-
= , r + max[`
, `O
-
= , r + max[`
, `O
= O
-
= , r + max[`
, `O
-
0
= , r + max[`
, `O
-
0
= , r + max[`
, `O
-
0
= , r + max[`
, `O
-
0
= , r + max[`
, `O
-
= r + max[`
, ` , O
= r + , argmax[`
, ` , ODQN
DoubleDQN
= r + , argmax[`
, ` , O
-
DoubleQ Q
: DeepQ-learning :DoubleQ-learning
-
: DeepQ-learning :DoubleQ-learning
DoubleQ Q !
60 !
-
?
-
, .
-
#2...
land1 land3 ...
-
#2...
...
-
3.DuelingNetwork
-
, ?0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
-
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
,+1+1+1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
3 3 3 3 0 0
3 3 3 3 0 0
0 0 0 0 0 0
-1 -1 -1 -1 -1 -1
-
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
, +1+1+1 +0-1-1+
0 0 0 0 0 0
0 0 0 -1 -1 0
0 0 0 -1 -1 0
-1 -1 0 -1 -1 0
-1 -1 0 -1 -1 0
-1 -1 0 -1 -1 0
-1 -1 -1 -1 -1 -1
-
: 10? 8?
: -2? 1?
: 5? 12?
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
?
-
!
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
-
?
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
-
: () : +1?+3? : +1?+2?
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
-
0()
+1?+3?
-1?-2?
10? 20?
0? 1?
14? 32?Q
?
-
0()
+1?+3?
-1?-2?Q
-
Q(s,a)=V(s)+A(s,a)
-
Q(s,a)=V(s)+A(s,a)Value
-
Q(s,a)=V(s)+A(s,a) Q
AdvantageValue
-
(, b)
(, O)
(, c)
DeepQ-Network
-
()
(, b)
(, O)
(, c)
Dueling Network
-
Q
Dueling Network
-
Q
-
Sum : , ; , , = ; , + (, ; , )
Max : , ; , , = ; , + , ; , [`
(, `; , )
Average: , ; , , = ; , + , ; , b (, `; , )[`
-
3.Duelingnetwork
: DQN :Sum : Max
60
-
3.Duelingnetwork
: DQN :Sum : Max
60 100!!!
-
AI 8
1. DeepQ-Network(2013)
2. DoubleQ-Learning(2015)
3. DuelingNetwork(2015)
4. PrioritizedExperienceReplay(2015)
5. Model-freeEpisodicControl(2016)
6. AsynchronousAdvantageousActor-Criticmethod(2016)
7. HumanCheckpointReplay(2016)
8. GradientDescentwithRestart(2016)
-
AI 8
1. DeepQ-Network(2013)
2. DoubleQ-Learning(2015)
3. DuelingNetwork(2015)
4. PrioritizedExperienceReplay(2015)
5. Model-freeEpisodicControl(2016)
6. AsynchronousAdvantageousActor-Criticmethod(2016)
7. HumanCheckpointReplay(2016)
8. GradientDescentwithRestart(2016)
-
? 8 ..?
-
1. Hyperparameter tuning
2. Debugging
3. Pretrainedmodel
4. Ensemblemethod
-
1. Hyperparameter tuning
-
Optimizer reward
-
70+
-
Network
TrainingmethodExperiencememory
Environment
-
1. activation_fn :activationfunction(relu,tanh,elu,leaky)2. initializer :weightinitializer(xavier)3. regularizer :weightregularizer(None,l1,l2)4. apply_reg :layerswhereregularizerisapplied5. regularizer_scale :scaleofregularizer6. hidden_dims :dimensionofnetwork7. kernel_size :kernelsizeofCNNnetwork8. stride_size :stridesizeofCNNnetwork9. dueling :whethertouseduelingnetwork10.double_q :whethertousedoubleQ-learning11.use_batch_norm :whethertousebatchnormalization
1. optimizer :typeofoptimizer(adam,adagrad,rmsprop)2. batch_norm :whethertousebatchnormalization3. max_step :maximumsteptotrain4. target_q_update_step :#ofsteptoupdatetargetnetwork5. learn_start_step :#ofsteptobeginatraining6. learning_rate_start :themaximumvalueoflearningrate7. learning_rate_end :theminimumvalueoflearningrate8. clip_grad :valueforamaxgradient9. clip_delta :valueforamaxdelta10. clip_reward :valueforamaxreward11. learning_rate_restart :whethertouselearningraterestart
1. history_length :thelengthofhistory2. memory_size :sizeofexperiencememory3. priority :whethertouseprioritizedexperiencememory4. preload_memory :whethertopreloadasavedmemory5. preload_batch_size :batchsizeofpreloadedmemory6. preload_prob :probabilityofsamplingfrompre-mem7. alpha :alphaofprioritizedexperiencememory8. beta_start:betaofprioritizedexperiencememory9. beta_end_t:betaofprioritizedexperiencememory
1. screen_height :#ofrowsforagridstate2. screen_height_to_use :actual#ofrowsforastate3. screen_width :#ofcolumnsforagridstate4. screen_width_to_us :actual#ofcolumnsforastate5. screen_expr :actual#ofrowforastate
c:cookie p:platform s:scoreofcells p:plainjelly
ex)(c+2*p)-s,i+h+m,[rj,sp],((c+2*p)-s)/2.,i+h+m,[rj,sp,hi]6. action_repeat :#ofreapt ofanaction(frameskip)
i :plainitem h:heal m:magic hi:height
sp :speed rj :remained
jump
-
,
-
!
-
for land in range(3, 7):
for cookie in cookies:
options = []
option = {
'hidden_dims': ["[800, 400, 200]", "[1000, 800, 200]"],
'double_q': [True, False],
'start_land': land,
'cookie': cookie,
}
options.append(option.copy())
array_run(options, double_q_test')
-
,,,,,, ,,, ,,, ,,,,,,,,, ,,, ,,, ,,,
-
2.Debugging
-
...
-
History
-
Q-value
-
DuelingNetwork
V(s)
A(s,a)
-
action Q 0
State 0
Tensorflow fully_connected activationfunction
default nn.relu
Activationfunction None !
-
tensorflow/contrib/layers/python/layers/layers.py
-
State Q
ex)
Exploration
reward ?
-
3.Pretrainedmodel
-
.
-
weight ,
weight
-
Max:3,586,503Max:3,199,324
-
4.Ensemblemethods
-
Experiment#1
Experiment#2
Experiment#3
Experiment#4
-
weight
?
-
Ex) weight
,
weight
-
Session
-
Session
Graph
-
Session
Graph
-
Session
Graph
-
Session
Graph
-
node
-
Session
Graph Graph
-
Session
Graph Graph
Session
-
,
-
with tf.session() as sess:network = Network()sess.run(network.output)
-
sessions = []g = tf.Graph()with g.as_default():
network = Network()sess = tf.Session(graph=g)sessions.append(sess)
sessions[0].run(network.output)
Session
-
!
-
A.I. 360
-
A.I.
-
,
-
.