nyc ml meetup talk july2015 finalfiles.meetup.com/1516886/nyc ml meetup talk - ufora.pdf · braxton...

77
NYC Machine Learning MeetUp “A peek under the hood at Ufora” Braxton McKee, CEO [email protected]

Upload: others

Post on 31-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

NYC  Machine  Learning  MeetUp

“A  peek  under  the  hood  at  Ufora”

Braxton  McKee,  [email protected]

Page 2: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Braxton McKee Founder, CEO

Keith KegleyPresident

Eyal GoldwergerExec. Chairman

Ben BeineckeHead of Strategy

Ronen HilewiczVP Engineering

Tom PetersEngineer

Steven BenarioHead of Product

Jay MoolenaarHead of Sales

Alex TsannesEngineer

Ross GoodwinEngineer

Amichai LevyEngineer

Tony JebaraScientific Advisor & Head of ML Lab at

Columbia

Page 3: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Why  should  I  have  to  write  a  different  program  if  I  have  1000  rows  or  1  billion?

Page 4: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Advantage

Shortest  time-­‐to-­‐codeWrite in  single-­‐threaded  python,  R,  or  equivalent.

Disadvantage

Slow,  can’t  handle  scale.

Page 5: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Advantage

Shortest  time-­‐to-­‐code

Maximal  performance

Write in  single-­‐threaded  python,  R,  or  equivalent.

Code  parallel  version  by  hand

Disadvantage

Slow,  can’t  handle  scale.

Lots of  code,  hard  to  get  right  (race  conditions).

Page 6: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Advantage

Shortest  time-­‐to-­‐code

Maximal  performance

Only  have  to  implement  a  few  patterns.

Write in  single-­‐threaded  python,  R,  or  equivalent.

Code  parallel  version  by  hand

Code against  APIs  that  implement  particular  patterns.  (MapReduce)

Disadvantage

Slow,  can’t  handle  scale.

Lots of  code,  hard  to  get  right  (race  conditions).

Can be  hard  to  fit  a  problem   into  the  given  pattern,  or  it’s  really  inefficient.

Page 7: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Advantage

Shortest  time-­‐to-­‐code

Maximal  performance

Only  have  to  implement  a  few  patterns.

Users  can  implement   lots  of  algorithms very  quickly  without  a  lot  of  code.

Write in  single-­‐threaded  python,  R,  or  equivalent.

Code  parallel  version  by  hand

Code against  APIs  that  implement  particular  patterns.  (MapReduce)

Figure  out  how  to  automatically  scale  up  programs  written  in  python,   R,  etc.

Disadvantage

Slow,  can’t  handle  scale.

Lots of  code,  hard  to  get  right  (race  conditions).

Can be  hard  to  fit  a  problem   into  the  given  pattern,  or  it’s  really  inefficient.

Tough   to  implement.    Can  be hard  to  make  optimal  decisions  without  knowing  what  the  computation  will  do.

Page 8: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Advantage

Shortest  time-­‐to-­‐code

Maximal  performance

Only  have  to  implement  a  few  patterns.

Users  can  implement   lots  of  algorithms very  quickly  without  a  lot  of  code.

Write in  single-­‐threaded  python,  R,  or  equivalent.

Code  parallel  version  by  hand

Code against  APIs  that  implement  particular  patterns.  (MapReduce)

Figure  out  how  to  automatically  scale  up  programs  written  in  python,   R,  etc.

Disadvantage

Slow,  can’t  handle  scale.

Lots of  code,  hard  to  get  right  (race  conditions).

Can be  hard  to  fit  a  problem   into  the  given  pattern,  or  it’s  really  inefficient.

Tough   to  implement.    Can  be hard  to  make  optimal  decisions  without  knowing  what  the  computation  will  do.

Page 9: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Business  Vision:

Empower  data  scientists  to  run  any  computation  on  any  data  quickly  and  easily  

by  automating the  engineering.

Page 10: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Technical  Vision:

Create  a  VM  that  can  automatically  scale  any  algorithm  across  a  large  number  of  machines  

without  direct  supervision  by  the  user.

Page 11: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

What  is  Ufora?

A  data-­‐processing  platform  that  automatically  parallelizes  user  programs  and  executes  them  across  a  cluster  of  machines.

Page 12: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Use  Cases:

•Data  processing  and  cleaning•Large-­‐scale  machine  learning•Modeling  and  simulation

Page 13: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Design  Goal:

Completely  separate  “what”  from  “how”

Page 14: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Key  Components

• Implicit  parallelism  at  language  level• JIT  compilation•Fault  Tolerance•Automatic  co-­‐location  of  data  and  compute

Page 15: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

What  do  we  give  up?

•Restrict  mutability  of  data-­‐structures•Restrict  side-­‐effects•Emphasize  “functional”  programming  style•Some  features  of  host  languages  won’t  work

Page 16: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

def isPrime(p):x  =  2while  x*x  <=  p:

if  p%x ==  0:return  0

x  =  x  +  1return  1

def filter(v,f):if  len(v)  ==  0:  

return  []if  len(v)  ==  1:

return  v  if  f(v[0])  else  []

mid  =  len(v)/2return  filter(v[:mid],f)  +  \

filter(v[mid:],f)

print  filter(range(100000000),isPrime)

Naturally parallel(divide and conquer)

A  Nice  Example

Naturally Sequential (because of the loop)

Page 17: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

CORE #1 CORE #2 CORE #3 CORE #4

0 – 25M 25M – 50M 50M – 75M 75M – 100M

100M Integers

0 – 50M 50M – 100M

filter(v,  isPrime)

Splitting

Adaptive Parallelism

Page 18: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Key  Components

• Implicit  parallelism• JIT  compilation•Fault  Tolerance•Automatic  co-­‐location  of  data  and  compute

Page 19: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Our  solution  – react  dynamically  as  the  program  runs

Watch  running   threads  to  see  what  blocks  of  data  they’re  accessing.

Page 20: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Our  solution  – react  dynamically  as  the  program  runs

Watch  running   threads  to  see  what  blocks  of  data  they’re  accessing.

Detect  when  blocks  of  data  absolutely  have  to  be  on  the  same  machine.

Page 21: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Our  solution  – react  dynamically  as  the  program  runs

Watch  running   threads  to  see  what  blocks  of  data  they’re  accessing.

Detect  when  blocks  of  data  absolutely  have  to  be  on  the  same  machine.

Build  a  statistical  models  of  correlations  between  block  accesses.

Page 22: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Our  solution  – react  dynamically  as  the  program  runs

Watch  running   threads  to  see  what  blocks  of  data  they’re  accessing.

Detect  when  blocks  of  data  absolutely  have  to  be  on  the  same  machine.

Build  a  statistical  models  of  correlations  between  block  accesses.

Place  data  to  minimize  expected  future  number  of  machine  boundary  crossings.

Page 23: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

A  Simple  Example

v  =  range(0,  2*10**9)

[0.0,1.0,2.0,3.0,4.0,…,1999999999.0]

v  =  range(0,  2*10**9)

User  writes

To  the  user,  ‘v’  is  now  a  big  contiguous  array:

[0.0,1.0,2.0,3.0,4.0,…,1999999999.0]

Page 24: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

What’s  going  on  under  the  hood?

In this example, each block is 1 GB. This block contains the first

125,000,000 numbers.

v

“v” is actually a pointer to a bunch of blocks of data

v  =  range(0,  2*10**9)

Page 25: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Machine  1

Machine  2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Machine  3

Machine  4

What  if  the  dataset  is  bigger  than  one  machine  can  hold?

Machines w/ 8 GB RAM16 GB of data

Page 26: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Machine  1

Machine  2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Machine  3

Machine  4

Simple  case  – just  stripe  data  linearly

Place first 4 GB on machine 1

Page 27: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

What  happens  when  we  start  using  the  data?

User  writes

Now  the  computation  wants  to  scan  sequentially  over  the  dataset

for  x  in  v:    # some  complicated  state  machine

Page 28: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Machine  1

Machine  2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Machine  3

Machine  4

Computation  starts  on  Machine  1

When  the  computation  exhausts  the  data  on  one  machine,  the  runtime  moves  it  to  the  next

for  x  in  v:    …  

Page 29: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

But  real  access  patterns  are  more  complex!

User  writes

Now  the  computation  is  looking  at  all  pairs  v[i]  and  v[i+10]  

res  =  0def f(x,y):

#  some  functionfor  i in  xrange(0,  len(v)-­‐10):  

res  =  res  +  f(v[i],  v[i+10])

Page 30: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Machine  1

Machine  2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Machine  3

Machine  4

But  when  the  computation  reaches  the  end  of  block  4,  v[i]  and  v[i+10]  aren’t  on  the  same  machine!

At  first,  everything  is  OK,  since  v[ix]  and  v[ix+10]  are  close  to  each  other  in  the  data

Page 31: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

When  v[i+10]  needs  data  on  machine  2,  the  runtime  has  to  move  the  

computation  to  the  other  machine

Block  4  on  Machine  1

Block  5  on  Machine  2v[i]

v[i+10]

Page 32: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

But  then  we  increment  ‘i’  and  have  to  go  back  to  the  first  machine.

Block  4  on  Machine  1

Block  5  on  Machine  2v[ix]

v[ix+10]

Page 33: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Now  the  computation  is  alternating  between  accessing  data  on  Machine  1  

and  Machine  2

Block  4  on  Machine  1

Block  5  on  Machine  2v[ix]

v[ix+10]

Page 34: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Every  time  we  have  to  move  the  computation,  we’re  hitting  the  network.

Block  4  on  Machine  1

Block  5  on  Machine  2v[ix]

v[ix+10]

This  is  really  slow!

Page 35: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Machine  1

Machine  2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Machine  3

Machine  4

Solution:  Replicate  blocks  so  that  they  overlap

5

9

13

Data  can  live  on  two  different  machines  at  the  same  time  because  its  immutable!

Page 36: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Machine  1

Machine  2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Machine  3

Machine  4

Solution:  Replicate  blocks  so  that  they  overlap

5

Now  v[i]  and  v[i+10]  are  always  available  together  on  some  machine

9

13

Page 37: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

What  about  a  different  access  pattern?

Imagine  the  user  writes

Now  the  computation  will  jump  back  and  forth  between  the  beginning  and  the  end  of  the  vector.

res  =  0def f(x,y):

#  some  functionfor  i in  xrange(0,  len(v)):  

res  =  res  +  f(v[i],  v[-­‐i])

Page 38: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Machine  1

Machine  2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Machine  3

Machine  4

5

9

13

The  computation  will  jump  back  and  forth  between  the  beginning  and  the  end  of  the  vector.

f(v[i],  v[-­‐i])

V[1]V[2]V[3]

V[-­‐1]V[-­‐2]V[-­‐3]

Page 39: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Machine  1

Machine  2

1

16

Machine  3

Machine  4

2 3 4

5 6 7 8

9 10 11 12

13 14 15

5

9

13

But  these  blocks  are  on  separate  machines,  which  we  know  is  slow

Page 40: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Machine  1

Machine  2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Machine  3

Machine  4

5

9

13

Under  this  layout,  the  computation  will  have  to  move  between  machines  2  billion  times!

Page 41: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Machine  1

Machine  2

1 2

3 4

5 6

7 8 9 10

11 12

14

15 16

Machine  3

Machine  4

13

But  if  we  lay  out  the  data  out  like  this,we’ll  only  have  to  move  a  few  times

Page 42: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

There  are  lots  of  different  access  patterns.

Each  one  has  a  different  optimal  layout.

Let’s  look  at  a  couple  that  show  up  in  ML  applications

Page 43: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

def dot(v1,  v2):return  sum(0,  len(v1),  

lambda  i:  v1[i]  *  v2[i])

Example:  Simple  Dot  Product

Page 44: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Example:  Simple  Dot  Product

def dot(v1,  v2):return  sum(0,  len(v1),  

lambda  i:  v1[i]  *  v2[i])

M1

M2

M3

M4

Page 45: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  Linear  Regressiondef range(a,b,f):

if  a  >=  b:return  []

if  a+1  >=  b:return  [f(a)]

mid  =  (a+b)/2return  range(a,mid,f)  +  range(mid,b,f)

def computeXtX(columns):return  range(0,  len(columns),  lambda  c1:

range(0,  len(columns),  lambda  c2:dot(columns[c1],  columns[c2])))

Basically computing a covariance matrix

Page 46: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

t

100

Each  column  is  1  GB

Example:  Linear  Regression

Page 47: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

t

100

Example:  Linear  Regression

Page 48: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

t

100

4950  Pairs

Example:  Linear  Regression

Page 49: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

t

100

4950  Pairs

Example:  Linear  Regression

M1

Page 50: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

t

100

4950  Pairs

Example:  Linear  Regression

M1

M2

M3

M4

Page 51: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTree

• Tries  to  learn  a  function  of  Y  given  a  set  of  X’s  in  N  dimensions.

• Common  algorithm  for  both  GBM and  random  forest

• Complex  parallelism  pattern• Has  both  static  data  (that  sits  still)  and  

rapidly  moving  data

Page 52: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTree

X0  >  10

X9  >  1.2 X33  >  -­‐5.0

X65  >  .111 Y=20 Y=30

Y=0Y=10

Y=1

Page 53: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTree

How  do  we  pick  split  points?

Page 54: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTree

X0

Y

Scan  over  the  data  for  each  column

Page 55: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTree

X0

Y

Build  a  histogram

Page 56: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTree

X0

Y

Pick  the  “X”  point  that  maximizes  separation  of  Y

Page 57: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTree Pick  the  best  rule  over  all  the  columns,  

and  divide  the  dataset  in  half  

according  to  that  rule

X9  >  1.2

Page 58: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTreeAnd  recurse!

X9  >  1.2

X0  >  0.5

X33  >  7.1

Page 59: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTreeAnd  compute  

averages  over  Y  over  all  the  leaf  datasets.

X9  >  1.2

X0  >  0.5

X33  >  7.1

Y  =  20 Y  =  30

Page 60: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTreeAnd  compute  

averages  over  Y  over  all  the  leaf  datasets.

X9  >  1.2

X0  >  0.5

X33  >  7.1

Y  =  20 Y  =  30

Page 61: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

tExample:  DecisionTree

We  don’t  actually  make  a  full  copy  of  the  data  at  each  split.

Instead,  we  can  track  “Active  Indices,”  which  contain  the  set  of  row  indices  in  a  given  subset.

This  is  much  smaller  than  the  whole  set.

Page 62: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

YT`

100

Example:  DecisionTree

M1

M2

M3

M4

Active  Indices

Page 63: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

t

100

Example:  DecisionTree

M1

M2

M3

M4

Active  Indices

Y

Page 64: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

t

100

Example:  DecisionTree

M1

M2

M3

M4

Active  Indices

Y

Page 65: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

t

100

Example:  DecisionTree

M1

M2

M3

M4

Active  Indices

Y

Page 66: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

How  is  this  implemented?

Backend  in  C++Codegen using  LLVM

Language  bindings  in  Python

Page 67: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Python Code

Ufora IR

FORA code

Threads

Data

Page 68: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Ufora Worker

Thread  Scheduler

Page 69: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Ufora Worker

JIT  compiler  (LLVM)

Thread  Scheduler

Page 70: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Ufora Worker

JIT  compiler  (LLVM)

Thread  Scheduler

Page 71: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Ufora Worker

JIT  compiler  (LLVM)

Thread  Scheduler

Data  Scheduler

Page 72: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Ufora Worker

JIT  compiler  (LLVM)

Thread  Scheduler

Data  Scheduler

Page 73: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Ufora Worker

JIT  compiler  (LLVM)

Thread  Scheduler

Data  Scheduler Global  Scheduler

Page 74: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

Data  backplane  to  other  Machines

Ufora Worker

JIT  compiler  (LLVM)

Thread  Scheduler

Data  Scheduler Global  Scheduler

Local  Disk

Page 75: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke
Page 76: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke
Page 77: NYC ML Meetup Talk July2015 Finalfiles.meetup.com/1516886/NYC ML Meetup Talk - Ufora.pdf · Braxton McKee Founder, CEO Keith Kegley President EyalGoldwerger Exec. Chairman Ben Beinecke

The  End•We’re  Hiring

•We  give  out  trial  licenses  for  compelling  projects

•Contact  me:    [email protected]