how alphago works

54
Presenter: Shane (Seungwhan) Moon PhD student Language Technologies Institute, School of Computer Science Carnegie Mellon University 3/2/2016 How it works

Upload: shane-seungwhan-moon

Post on 16-Apr-2017

90.288 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: How AlphaGo Works

Presenter: Shane  (Seungwhan)  Moon

PhD  studentLanguage  Technologies  Institute,  School  of  Computer  Science

Carnegie  Mellon  University

3/2/2016

How  it  works

Page 2: How AlphaGo Works

AlphaGo vs  European  Champion  (Fan  Hui 2-­‐Dan)

October  5  – 9,  2015

<Official  match>-­‐ Time  limit:  1  hour-­‐ AlphaGo Wins (5:0)

*rank

Page 3: How AlphaGo Works

AlphaGo vs  World  Champion  (Lee  Sedol 9-­‐Dan)

March  9  – 15,  2016

<Official  match>-­‐ Time  limit:  2  hours

Venue:  Seoul,  Four  Seasons  Hotel

Image  Source: Josun  Times Jan  28th 2015

Page 4: How AlphaGo Works

Lee  Sedol

Photo  source: Maeil  Economics 2013/04

wiki

Page 5: How AlphaGo Works

Computer  Go  AI?

Page 6: How AlphaGo Works

Computer  Go  AI – Definition

s (state)

d  =  1

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 1 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

=

(e.g.  we  can  represent  the  board  into  a  matrix-­‐like  form)

*  The  actual  model  uses  other  features  than  board  positions  as  well

Page 7: How AlphaGo Works

Computer  Go  AI  – Definition

s (state)

d  =  1 d  =  2

a (action)

Given  s,  pick  the  best  a

Computer  GoArtificial  

Intelligences a s'

Page 8: How AlphaGo Works

Computer  Go  AI – An  Implementation  Idea?d  =  1 d  =  2

How  about  simulating  all  possible  board  positions?

Page 9: How AlphaGo Works

Computer  Go  AI  – An  Implementation  Idea?d  =  1 d  =  2

d  =  3

Page 10: How AlphaGo Works

Computer  Go  AI  – An  Implementation  Idea?d  =  1 d  =  2

d  =  3

… d  =  maxD

Process  the  simulation  until  the  game  ends,then  report  win  /  lose  results

Page 11: How AlphaGo Works

Computer  Go  AI  – An  Implementation  Idea?d  =  1 d  =  2

d  =  3

… d  =  maxD

Process  the  simulation  until  the  game  ends,then  report  win  /  lose  results

e.g. it  wins  13  times  if  the  next  stone  gets  placed  here

37,839  times

431,320  times

Choose  the  “next  action  /  stone”that  has  the  most  win-­‐counts  in  the  full-­‐scale  simulation

Page 12: How AlphaGo Works

This  is  NOT  possible;  it  is  said  the  possible  configurations  of   the  board  exceeds  the  number   of  atoms  in  the  universe

Page 13: How AlphaGo Works

Key: To  Reduce Search  Space

Page 14: How AlphaGo Works

Reducing  Search  Space

1.  Reducing  “action  candidates”  (Breadth  Reduction)

d  =  1 d  =  2

d  =  3

… d  =  maxD

Win?Loss?

IF  there  is  a  model  that  can  tell  you  that  these  movesare  not  common  /  probable  (e.g.  by  experts,  etc.)  …

Page 15: How AlphaGo Works

Reducing  Search  Space

1.  Reducing  “action  candidates”  (Breadth  Reduction)

d  =  1 d  =  2

d  =  3

… d  =  maxD

Win?Loss?

Remove  these  from  search  candidates  in  advance (breadth  reduction)

Page 16: How AlphaGo Works

Reducing  Search  Space

2.  Position  evaluation  ahead  of  time  (Depth  Reduction)

d  =  1 d  =  2

d  =  3

… d  =  maxD

Win?Loss?

Instead  of  simulating  until  the  maximum  depth ..

Page 17: How AlphaGo Works

Reducing  Search  Space

2.  Position  evaluation  ahead  of  time  (Depth  Reduction)

d  =  1 d  =  2

d  =  3

V  =  1

V  =  2

V  =  10

IF  there  is  a  function  that  can  measure:V(s):   “board  evaluation  of  state  s”

Page 18: How AlphaGo Works

Reducing  Search  Space

1. Reducing  “action  candidates”  (Breadth  Reduction)

2. Position  evaluation  ahead  of  time  (Depth  Reduction)

Page 19: How AlphaGo Works

1.  Reducing  “action  candidates”

Learning:  P  (  next  action  |  current  state  )

=  P  (  a  |  s  )

Page 20: How AlphaGo Works

1.  Reducing  “action  candidates”

(1) Imitating  expert  moves  (supervised  learning)

Current  State

Prediction  Model

Next  State

s1 s2

s2 s3

s3 s4

Data:  Online  Go experts (5~9  dan)160K games, 30M  board  positions

Page 21: How AlphaGo Works

1.  Reducing  “action  candidates”

(1) Imitating  expert  moves  (supervised  learning)

Prediction  Model

Current  Board Next  Board

Page 22: How AlphaGo Works

1.  Reducing  “action  candidates”

(1) Imitating  expert  moves  (supervised  learning)

Prediction  Model

Current  Board Next  Action

There  are  19  X  19  =  361possible  actions(with  different  probabilities)

Page 23: How AlphaGo Works

1.  Reducing  “action  candidates”

(1) Imitating  expert  moves  (supervised  learning)

Prediction  Model

0 0   0 0 0   0 0 0 00 0   0 0 0 1 0 0 00 -­‐1 0 0 1 -­‐1 1 0 00 1 0 0 1 -­‐1 0 0 00 0   0 0 -­‐1 0 0 0 00 0   0 0 0   0 0 0 00 -­‐1 0 0 0   0 0 0 00 0   0 0 0   0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

s af:  sà a

Current  Board Next  Action

Page 24: How AlphaGo Works

1.  Reducing  “action  candidates”

(1) Imitating  expert  moves  (supervised  learning)

Prediction  Model

0 0   0 0 0   0 0 0 00 0   0 0 0 1 0 0 00 -­‐1 0 0 1 -­‐1 1 0 00 1 0 0 1 -­‐1 0 0 00 0   0 0 -­‐1 0 0 0 00 0   0 0 0   0 0 0 00 -­‐1 0 0 0   0 0 0 00 0   0 0 0   0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

s g:  sà p(a|s)

0 0 0 0 0 0 0 0 00 0 0 0 0 0             0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0.2 0.1 0 00 0 0 0 0 0.4  0.2 0 00 0 0 0 0 0.1       0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

p(a|s) aargmax

Current  Board Next  Action

Page 25: How AlphaGo Works

1.  Reducing  “action  candidates”

(1) Imitating  expert  moves  (supervised  learning)

Prediction  Model

0 0   0 0 0   0 0 0 00 0   0 0 0 1 0 0 00 -­‐1 0 0 1 -­‐1 1 0 00 1 0 0 1 -­‐1 0 0 00 0   0 0 -­‐1 0 0 0 00 0   0 0 0   0 0 0 00 -­‐1 0 0 0   0 0 0 00 0   0 0 0   0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

s g:  sà p(a|s) p(a|s) aargmax

Current  Board Next  Action

Page 26: How AlphaGo Works

1.  Reducing  “action  candidates”

(1) Imitating  expert  moves  (supervised  learning)

Deep  Learning(13  Layer  CNN)

0 0   0 0 0   0 0 0 00 0   0 0 0 1 0 0 00 -­‐1 0 0 1 -­‐1 1 0 00 1 0 0 1 -­‐1 0 0 00 0   0 0 -­‐1 0 0 0 00 0   0 0 0   0 0 0 00 -­‐1 0 0 0   0 0 0 00 0   0 0 0   0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

s g:  sà p(a|s)

0 0 0 0 0 0 0 0 00 0 0 0 0 0             0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0.2 0.1 0 00 0 0 0 0 0.4  0.2 0 00 0 0 0 0 0.1       0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

p(a|s) aargmax

Current  Board Next  Action

Page 27: How AlphaGo Works

Convolutional  Neural  Network  (CNN)

CNN  is  a  powerful  model  for  image  recognition  tasks;  it  abstracts  out  the  input  image  through  convolution  layers

Image  source

Page 28: How AlphaGo Works

Convolutional  Neural  Network  (CNN)

And  they  use  this  CNN  model  (similar  architecture)  to  evaluate  the  board  position;  which learns  “some”  spatial  invariance

Page 29: How AlphaGo Works

Go: abstraction  is  the  key  to  win

CNN:  abstraction  is  its  forte

Page 30: How AlphaGo Works

1.  Reducing  “action  candidates”

(1) Imitating  expert  moves  (supervised  learning)

Expert  Moves  Imitator  Model(w/  CNN)

Current  Board Next  Action

Training:

Page 31: How AlphaGo Works

1.  Reducing  “action  candidates”

(2) Improving  through  self-­‐plays  (reinforcement  learning)

Expert  Moves  Imitator  Model

(w/  CNN)

Expert  Moves  Imitator  Model

(w/  CNN)VS

Improving  by  playing  against  itself

Page 32: How AlphaGo Works

1.  Reducing  “action  candidates”

(2) Improving  through  self-­‐plays  (reinforcement  learning)

Expert  Moves  Imitator  Model

(w/  CNN)

Expert  Moves  Imitator  Model

(w/  CNN)VS

Return:  board  positions, win/lose info

Page 33: How AlphaGo Works

1.  Reducing  “action  candidates”

(2) Improving  through  self-­‐plays  (reinforcement  learning)

Expert  Moves  Imitator  Model(w/  CNN)

Board  position win/loss

Training:

Lossz  =  -­‐1

Page 34: How AlphaGo Works

1.  Reducing  “action  candidates”

(2) Improving  through  self-­‐plays  (reinforcement  learning)

Expert  Moves  Imitator  Model(w/  CNN)

Training:

z  =  +1

Board  position win/loss

Win

Page 35: How AlphaGo Works

1.  Reducing  “action  candidates”

(2) Improving  through  self-­‐plays  (reinforcement  learning)

Updated  Modelver 1.1

Updated  Modelver 1.3VS

Return:  board  positions, win/lose info

It  uses  the  same  topology  as  the  expert  moves  imitator  model,  and  just  uses  the  updated parameters

Older  models  vs.  newer  models

Page 36: How AlphaGo Works

1.  Reducing  “action  candidates”

(2) Improving  through  self-­‐plays  (reinforcement  learning)

Updated  Model  ver 1.3

Updated  Model  ver 1.7VS

Return:  board  positions, win/lose info

Page 37: How AlphaGo Works

1.  Reducing  “action  candidates”

(2) Improving  through  self-­‐plays  (reinforcement  learning)

Updated  Model  ver 1.5

Updated  Model  ver 2.0VS

Return:  board  positions, win/lose info

Page 38: How AlphaGo Works

1.  Reducing  “action  candidates”

(2) Improving  through  self-­‐plays  (reinforcement  learning)

Updated  Model  ver 3204.1

Updated  Model  ver 46235.2VS

Return:  board  positions, win/lose info

Page 39: How AlphaGo Works

1.  Reducing  “action  candidates”

(2) Improving  through  self-­‐plays  (reinforcement  learning)

Updated  Model  ver 1,000,000VS

The  final  model  wins 80%  of  the timewhen  playing  against  the  first  model

Expert  Moves  Imitator  Model

Page 40: How AlphaGo Works

2.  Board  Evaluation

Page 41: How AlphaGo Works

2.  Board  Evaluation

Updated  Modelver 1,000,000

Board  Position

Training:

Win  /  Loss

Win(0~1)

Value  Prediction  Model

(Regression)

Adds  a regression  layer  to  the  modelPredicts  values  between  0~1Close  to  1:  a  good  board  positionClose  to  0:  a  bad  board  position

Page 42: How AlphaGo Works

Reducing  Search  Space

1. Reducing  “action  candidates”(Breadth  Reduction)

2. Board  Evaluation (Depth  Reduction)

Policy  Network

Value  Network

Page 43: How AlphaGo Works

Looking  ahead  (w/  Monte  Carlo  Search  Tree)

Action  Candidates  Reduction(Policy  Network)

Board  Evaluation(Value  Network)

(Rollout):  Faster  version  of  estimating  p(a|s)à uses shallow  networks  (3  msà 2µs)

Page 44: How AlphaGo Works

ResultsElo rating  system

Performance  with  different  combinations  of  AlphaGo components

Page 45: How AlphaGo Works

TakeawaysUse  the  networks  trained  for  a  certain  task  (with  different  loss  objectives)  for  several  other  tasks

Page 46: How AlphaGo Works

Lee  Sedol 9-­‐dan vs  AlphaGo

Page 47: How AlphaGo Works

Lee  Sedol 9-­‐dan vs  AlphaGoEnergy  Consumption

Lee  Sedol AlphaGo

-­‐ Recommended calories  for  a man per  day: ~2,500 kCal

-­‐ Assumption: Lee consumes  the  entire  amount  of  per-­‐day calories  in  this  one  game2,500  kCal *  4,184  J/kCal

~=  10M  [J]

-­‐ Assumption: CPU:  ~100  W,  GPU:  ~300 W-­‐ 1,202 CPUs, 176 GPUs

170,000  J/sec  *  5  hr *  3,600  sec/hr

~=  3,000M  [J]

A  very,  very  rough  calculation  ;)

Page 48: How AlphaGo Works

AlphaGo is  estimated  to  be  around  ~5-­‐dan

=  multiple  machines European  champion

Page 49: How AlphaGo Works

Taking  CPU  /  GPU resources  to  virtually  infinity?

But  Google  has  promised  not  to  use  more  CPU/GPUsthan  they  used  for  Fan  Hui  for  the  game  with  Lee

No  one  knowshow it  will   converge

Page 50: How AlphaGo Works

AlphaGo learns  millions  of  Go  games  every  day

AlphaGo will  presumably  converge  to  some  point  eventually.

However,  in  the  Nature  paper  they  don’t  report  how  AlphaGo’s performance  improvesas  a  function  of  times  AlphaGo plays  against  itself  (self-­‐plays).

Page 51: How AlphaGo Works

What  if  AlphaGo learns  Lee’s  game  strategy

Google  said  they  won’t  use  Lee’s  game  plays  as  AlphaGo’s training  data  

Even  if  it  does,  it  won’t  be  easy  to  modify  the  model  trained  over  millions  ofdata  points  with  just  a  few  game  plays  with  Lee

(prone  to  over-­‐fitting,  etc.)

Page 52: How AlphaGo Works

AlphaGo’s Weakness?

Page 53: How AlphaGo Works

AlphaGo – How  It  WorksPresenter: Shane  (Seungwhan)  Moon

PhD  studentLanguage  Technologies  Institute,  School  of  Computer  Science

Carnegie  Mellon  [email protected]

3/2/2016

Page 54: How AlphaGo Works

Reference

• Silver,  David,  et  al.  "Mastering  the  game  of  Go  with  deep  neural  networks  and  tree  search." Nature 529.7587  (2016):  484-­‐489.