딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 ai 구현하기 deview 2016

172
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 김태훈

Upload: taehoon-kim

Post on 06-Jan-2017

18.045 views

Category:

Technology


10 download

TRANSCRIPT

  • AI

  • EMNLP,DataCom,IJBDI

    http://carpedm20.github.io

  • +

  • PlayingAtariwithDeepReinforcementLearning(2013)

    https://deepmind.com/research/dqn/

  • MasteringthegameofGowithdeepneuralnetworksandtreesearch(2016)

    https://deepmind.com/research/alphago/

  • (2016)

    http://vizdoom.cs.put.edu.pl/

  • +

  • +

  • + ?ReinforcementLearning

  • Machine Learning

  • ?

    ?

  • ?

    ?

    Classification

  • ?

    ?

    ?

    ?

  • Clustering

  • ?

    ?

    ?

  • ?

  • ( )

  • ( )

  • (ReinforcementLearning)

  • Agent

  • Environment

    Agent

  • Environment

    Agent

    StateE

  • Environment

    Agent

    StateE

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

  • Environment

    Agent

    StateE ActionE = 2

  • Environment

    Agent

    ActionE = 2

    StateE

  • Environment

    Agent

    ActionE = 2StateE RewardE = 1

  • Environment

    Agent

    ActionE = 2StateE RewardE = 1

  • Environment

    Agent

    StateE

  • Environment

    Agent

    ActionE = 0StateE

  • Environment

    Agent

    ActionE = 0StateE RewardE = 1

  • ,

    Agent action

    action state

    reward

    : reward action

  • Environment

    Agent

    ActionaJStatesJ RewardrJ

    ReinforcementLearning

  • ,

  • DL+RL ?

  • AlphaRun

    AlphaRun

  • +

  • ?

  • ?

  • 30 ( 8)

    30

    9 (2 )

    7

    4

    30830MO74=

    5,040

  • 4

    1 6

    30830MO746 =

    14

  • AlphaRun

    AlphaRun

  • A.I.

  • AI 8

    1. DeepQ-Network(2013)

    2. DoubleQ-Learning(2015)

    3. DuelingNetwork(2015)

    4. PrioritizedExperienceReplay(2015)

    5. Model-freeEpisodicControl(2016)

    6. AsynchronousAdvantageousActor-Criticmethod(2016)

    7. HumanCheckpointReplay(2016)

    8. GradientDescentwithRestart(2016)

  • AI 8

    1. DeepQ-Network(2013)

    2. DoubleQ-Learning(2015)

    3. DuelingNetwork(2015)

    4. PrioritizedExperienceReplay(2015)

    5. Model-freeEpisodicControl(2016)

    6. AsynchronousAdvantageousActor-Criticmethod(2016)

    7. HumanCheckpointReplay(2016)

    8. GradientDescentwithRestart(2016)

  • 1.DeepQ-Network

  • StateE0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 -1 -1

    0 1 1 3 3 -1 -1 -1

    0 1 1 0 0 -1 -1 -1

    -1 -1 -1 -1 -1 -1 -1 -1

  • ActionE =?0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 -1 -1

    0 1 1 3 3 -1 -1 -1

    0 1 1 0 0 -1 -1 -1

    -1 -1 -1 -1 -1 -1 -1 -1

    ActionE

    None 0

    ? ??

  • ActionE =?0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 -1 -1

    0 1 1 3 3 -1 -1 -1

    0 1 1 0 0 -1 -1 -1

    -1 -1 -1 -1 -1 -1 -1 -1

    E

  • ActionE =?0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 -1 -1

    0 1 1 3 3 -1 -1 -1

    0 1 1 0 0 -1 -1 -1

    -1 -1 -1 -1 -1 -1 -1 -1

    E

    Q

  • QState s Action a Q

    s, a

  • s Q s,Q s,

    Q s,

  • s Q s, = 5Q s, = 0

    Q s, = 10

  • s Q s, = 5Q s, = 0

    Q s, = 10

  • Q:

  • s =

  • s =

  • Q

  • +1+1+5+

    +1+1+

    +0+1+

    Q

  • -1-1-1+

    +1+1+

    -1+1-1+

    Q

  • = r + max[`

    , ` , O

  • ?

  • #1...

    ,

  • 2.DoubleQ-Learning

  • ,

  • or

  • = , r + max[`

    , `O

  • = , r + max[`

    , `O

    = O

  • = , r + max[`

    , `O

  • 0

    = , r + max[`

    , `O

  • 0

    = , r + max[`

    , `O

  • 0

    = , r + max[`

    , `O

  • 0

    = , r + max[`

    , `O

  • = r + max[`

    , ` , O

    = r + , argmax[`

    , ` , ODQN

    DoubleDQN

    = r + , argmax[`

    , ` , O

  • DoubleQ Q

    : DeepQ-learning :DoubleQ-learning

  • : DeepQ-learning :DoubleQ-learning

    DoubleQ Q !

    60 !

  • ?

  • , .

  • #2...

    land1 land3 ...

  • #2...

    ...

  • 3.DuelingNetwork

  • , ?0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

  • 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

    ,+1+1+1

    0 0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0

    3 3 3 3 0 0

    3 3 3 3 0 0

    0 0 0 0 0 0

    -1 -1 -1 -1 -1 -1

  • 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

    , +1+1+1 +0-1-1+

    0 0 0 0 0 0

    0 0 0 -1 -1 0

    0 0 0 -1 -1 0

    -1 -1 0 -1 -1 0

    -1 -1 0 -1 -1 0

    -1 -1 0 -1 -1 0

    -1 -1 -1 -1 -1 -1

  • : 10? 8?

    : -2? 1?

    : 5? 12?

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

    ?

  • !

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

  • ?

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

  • : () : +1?+3? : +1?+2?

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

  • 0()

    +1?+3?

    -1?-2?

    10? 20?

    0? 1?

    14? 32?Q

    ?

  • 0()

    +1?+3?

    -1?-2?Q

  • Q(s,a)=V(s)+A(s,a)

  • Q(s,a)=V(s)+A(s,a)Value

  • Q(s,a)=V(s)+A(s,a) Q

    AdvantageValue

  • (, b)

    (, O)

    (, c)

    DeepQ-Network

  • ()

    (, b)

    (, O)

    (, c)

    Dueling Network

  • Q

    Dueling Network

  • Q

  • Sum : , ; , , = ; , + (, ; , )

    Max : , ; , , = ; , + , ; , [`

    (, `; , )

    Average: , ; , , = ; , + , ; , b (, `; , )[`

  • 3.Duelingnetwork

    : DQN :Sum : Max

    60

  • 3.Duelingnetwork

    : DQN :Sum : Max

    60 100!!!

  • AI 8

    1. DeepQ-Network(2013)

    2. DoubleQ-Learning(2015)

    3. DuelingNetwork(2015)

    4. PrioritizedExperienceReplay(2015)

    5. Model-freeEpisodicControl(2016)

    6. AsynchronousAdvantageousActor-Criticmethod(2016)

    7. HumanCheckpointReplay(2016)

    8. GradientDescentwithRestart(2016)

  • AI 8

    1. DeepQ-Network(2013)

    2. DoubleQ-Learning(2015)

    3. DuelingNetwork(2015)

    4. PrioritizedExperienceReplay(2015)

    5. Model-freeEpisodicControl(2016)

    6. AsynchronousAdvantageousActor-Criticmethod(2016)

    7. HumanCheckpointReplay(2016)

    8. GradientDescentwithRestart(2016)

  • ? 8 ..?

  • 1. Hyperparameter tuning

    2. Debugging

    3. Pretrainedmodel

    4. Ensemblemethod

  • 1. Hyperparameter tuning

  • Optimizer reward

  • 70+

  • Network

    TrainingmethodExperiencememory

    Environment

  • 1. activation_fn :activationfunction(relu,tanh,elu,leaky)2. initializer :weightinitializer(xavier)3. regularizer :weightregularizer(None,l1,l2)4. apply_reg :layerswhereregularizerisapplied5. regularizer_scale :scaleofregularizer6. hidden_dims :dimensionofnetwork7. kernel_size :kernelsizeofCNNnetwork8. stride_size :stridesizeofCNNnetwork9. dueling :whethertouseduelingnetwork10.double_q :whethertousedoubleQ-learning11.use_batch_norm :whethertousebatchnormalization

    1. optimizer :typeofoptimizer(adam,adagrad,rmsprop)2. batch_norm :whethertousebatchnormalization3. max_step :maximumsteptotrain4. target_q_update_step :#ofsteptoupdatetargetnetwork5. learn_start_step :#ofsteptobeginatraining6. learning_rate_start :themaximumvalueoflearningrate7. learning_rate_end :theminimumvalueoflearningrate8. clip_grad :valueforamaxgradient9. clip_delta :valueforamaxdelta10. clip_reward :valueforamaxreward11. learning_rate_restart :whethertouselearningraterestart

    1. history_length :thelengthofhistory2. memory_size :sizeofexperiencememory3. priority :whethertouseprioritizedexperiencememory4. preload_memory :whethertopreloadasavedmemory5. preload_batch_size :batchsizeofpreloadedmemory6. preload_prob :probabilityofsamplingfrompre-mem7. alpha :alphaofprioritizedexperiencememory8. beta_start:betaofprioritizedexperiencememory9. beta_end_t:betaofprioritizedexperiencememory

    1. screen_height :#ofrowsforagridstate2. screen_height_to_use :actual#ofrowsforastate3. screen_width :#ofcolumnsforagridstate4. screen_width_to_us :actual#ofcolumnsforastate5. screen_expr :actual#ofrowforastate

    c:cookie p:platform s:scoreofcells p:plainjelly

    ex)(c+2*p)-s,i+h+m,[rj,sp],((c+2*p)-s)/2.,i+h+m,[rj,sp,hi]6. action_repeat :#ofreapt ofanaction(frameskip)

    i :plainitem h:heal m:magic hi:height

    sp :speed rj :remained

    jump

  • ,

  • !

  • for land in range(3, 7):

    for cookie in cookies:

    options = []

    option = {

    'hidden_dims': ["[800, 400, 200]", "[1000, 800, 200]"],

    'double_q': [True, False],

    'start_land': land,

    'cookie': cookie,

    }

    options.append(option.copy())

    array_run(options, double_q_test')

  • ,,,,,, ,,, ,,, ,,,,,,,,, ,,, ,,, ,,,

  • 2.Debugging

  • ...

  • History

  • Q-value

  • DuelingNetwork

    V(s)

    A(s,a)

  • action Q 0

    State 0

    Tensorflow fully_connected activationfunction

    default nn.relu

    Activationfunction None !

  • tensorflow/contrib/layers/python/layers/layers.py

  • State Q

    ex)

    Exploration

    reward ?

  • 3.Pretrainedmodel

  • .

  • weight ,

    weight

  • Max:3,586,503Max:3,199,324

  • 4.Ensemblemethods

  • Experiment#1

    Experiment#2

    Experiment#3

    Experiment#4

  • weight

    ?

  • Ex) weight

    ,

    weight

  • Session

  • Session

    Graph

  • Session

    Graph

  • Session

    Graph

  • Session

    Graph

  • node

  • Session

    Graph Graph

  • Session

    Graph Graph

    Session

  • ,

  • with tf.session() as sess:network = Network()sess.run(network.output)

  • sessions = []g = tf.Graph()with g.as_default():

    network = Network()sess = tf.Session(graph=g)sessions.append(sess)

    sessions[0].run(network.output)

    Session

  • !

  • A.I. 360

  • A.I.

  • ,

  • .