fault tolerant clustering (ieee services 2012)

30
Fault Tolerant Clustering in Scien2fic Workflows Weiwei Chen, Ewa Deelman Informa2on Sciences Ins2tute University of Southern California 1

Upload: weiwei-chen

Post on 21-Jun-2015

106 views

Category:

Technology


1 download

DESCRIPTION

fault tolerant clustering for workflows

TRANSCRIPT

Page 1: Fault Tolerant Clustering (IEEE Services 2012)

Fault  Tolerant  Clustering  in  Scien2fic  Workflows  

Weiwei  Chen,  Ewa  Deelman  Informa2on  Sciences  Ins2tute  

University  of  Southern  California  

1  

Page 2: Fault Tolerant Clustering (IEEE Services 2012)

Outline  

•  Introduc2on  •  Workflow  and  Failure  Model  •  Fault  Tolerant  Clustering  •  Experiments  •  Task  Specific  Failures  •  Loca2on  Specific  Failures  

2  

Page 3: Fault Tolerant Clustering (IEEE Services 2012)

Introduc2on    •  Task  based  Scien2fic  Workflows  –  Task  –  Job  

•  Task  Clustering    – Merges  mul2ple  small  tasks  into  a  job    –  Reduce  scheduling  and  submit  overhead  

•  Fault  Tolerance  in  Task  Clustering  –  Exis2ng  techniques  underes2mate  or  ignore  the  influences  of  failures  

3  

Page 4: Fault Tolerant Clustering (IEEE Services 2012)

Task  Clustering    •  Task  Clustering    – Horizontal  Clustering  – Ver2cal  Clustering  – Arbitrary  Clustering  

Clustering  Factor  (k):  number  of  tasks  in  a  job  4  

Page 5: Fault Tolerant Clustering (IEEE Services 2012)

System  Overview    

5  

Timeline  

with  clustering  

without  clustering  

Improvement  

scheduling  and  submit  delay  

Page 6: Fault Tolerant Clustering (IEEE Services 2012)

Task  Failures  and  Job  Failures  •  We  only  focus  on  Transient  Failure  and  Job  Retry  •  We  don’t  differen2ate  the  causes  of  failures  but  we  concern  about  the  average  failure  rate.    

•  Assump2on:  a  failure  is  a  random  event  independent  of  workflow  characteris2cs  or  execu2on  environment    

•  Two  Categories  o  Task  Failure:  a  task  fails,  other  

tasks  in  the  same  job  may  not  fail  §  E.g.  Applica2on  

  o  Job  Failure:  a  job  fails,  all  of  its  tasks  fail  §  E.g.  Scheduling  System  

 

6  

Page 7: Fault Tolerant Clustering (IEEE Services 2012)

Influence  of  Failures  on  Clustering  ttotal   Es2mated  Overall  Run2me  

n   Number  of  tasks  to  run  

t   Run2me  of  a  single  task  

r   Number  of  available  resources  

d   Time  delay  between  jobs  

N   Expected  retry  2mes  for  a  single  task  

k   Number  of  tasks  in  a  job  

β   Job  failure  rate  

α   Task  failure  rate  Target  Func2on:  Min  (ttotal)  

given  n  tasks  to  run  on  r  resources  task  failure  rate  (α)  is  measurable  (Task  Failure  Model)  or  job  failure  rate  (β)  is  measurable  (Job  Failure  Model)  

 Assump2on:  n  >>  r,  but  n/k  >>  r    7  

Page 8: Fault Tolerant Clustering (IEEE Services 2012)

Job  Failure  Model  

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−β)

, if nk≥ r

N(kt + d) = (kt + d)1−β

, if nk< r

#

$

%%

&

%%

8  

ttotal   Es2mated  Overall  Run2me  

n   Number  of  tasks  to  run  

t   Run2me  of  a  single  task  

r   Number  of  available  resources  

d   Time  delay  between  jobs  

N   Expected  retry  2mes  for  a  single  task  

k   Number  of  tasks  in  a  job  

β   Job  failure  rate  

α   Task  failure  rate  

N job =1

(1−β)

Ntotal =

N jobnrk

if nk≥ r

N job, if nk< r

"

#

$$

%

$$

t job = kt + d

ttotal = t jobNtotal

Run2me  for  a  single  job  

Avg  retry  2me  for  a  single  job  

Retry  2me  for  all  jobs  

Overall  run2me  

Page 9: Fault Tolerant Clustering (IEEE Services 2012)

Job  Failure  Model  

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−β)

, if nk≥ r

N(kt + d) = (kt + d)1−β

, if nk< r

#

$

%%

&

%%

k* = nr

ttotal* =(kt + d)1−β

It’s  not  necessary  to  adjust  k.  Just  set  it  to  be  

9  

n=1000,  t=5  sec,  d=5  sec,  r=20  

k*  is  independent  of  β    

Page 10: Fault Tolerant Clustering (IEEE Services 2012)

Task  Failure  Model  

10  

ttotal   Es2mated  Overall  Run2me  

n   Number  of  tasks  to  run  

t   Run2me  of  a  single  task  

r   Number  of  available  resources  

d   Time  delay  between  jobs  

N   Expected  retry  2mes  for  a  single  task  

k   Number  of  tasks  in  a  job  

β   Job  failure  rate  

α   Task  failure  rate  

N job =1

(1−α)k

Ntotal =

N jobnrk

if nk≥ r

N job, if nk< r

"

#

$$

%

$$

t job = kt + d

ttotal = t jobNtotal

Run2me  for  a  single  job  

Avg  retry  2me  for  a  single  job  

Retry  2me  for  all  jobs  

Overall  run2me  

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−α)k

, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

#

$

%%

&

%%

Page 11: Fault Tolerant Clustering (IEEE Services 2012)

Task  Failure  Model  

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−α)k

, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

#

$

%%

&

%%

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =

n(k*t + d)rk(1−α)k

*

k*  is  dependent  of  α    

It’s  necessary  to  adjust  k  according  to  α  

11  

Page 12: Fault Tolerant Clustering (IEEE Services 2012)

Comparing  TFM  and  JFM  

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =

n(k*t + d)rk(1−α)k

*

ttotal* =(kt + d)1−β

k* = nr

1.  Linear  increase  vs  exponen2al  increase  2.  Op2mal  clustering  factor  

12  

Page 13: Fault Tolerant Clustering (IEEE Services 2012)

Fault  Tolerant  Clustering  •  Job  Failure  Model:  k=n/r  •  Selec2ve  Reclustering  (SR)  – select  the  failed  tasks  in  a  clustered  job  and  cluster  them  into  a  new  clustered  job    

–  It  requires  the  iden2fica2on  of  failed  tasks.  

13  

Page 14: Fault Tolerant Clustering (IEEE Services 2012)

Fault  Tolerant  Clustering  

•  Dynamic  Clustering  (DC)  – adjust  the  clustering  factor  according  to  the  task  failure  rates  dynamically  

ttotal,DC* =

n(k*t + d)rk*(1−α)k

*

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

14  

Page 15: Fault Tolerant Clustering (IEEE Services 2012)

Fault  Tolerant  Clustering  

•  Dynamic  Reclustering  (DR)  – A  combina2on  of  SR  and  DC  

15  

Page 16: Fault Tolerant Clustering (IEEE Services 2012)

Evalua2on  

•  Run  simula2ons  based  on  the  real  traces  that  were  run  by  the  Pegasus  group.    

•  Each  workflow  was  simulated  100  2mes  so  that  the  standard  devia2on  is  less  than  10%  

•  Two  workflows  were  used.    •  20  worker  nodes  were  used  in  each  experiment.    

16  

Page 17: Fault Tolerant Clustering (IEEE Services 2012)

Workflows  Used  •  Montage  – An  astronomy  applica2on  used  to  construct   large  image  mosaics  of  the  sky.    

– Montage   has   complex   data   dependencies  between  tasks    

– 10,422  tasks,  57GB  data.    

17  Image  from  hhp://montage.ipac.caltech.edu/  

Page 18: Fault Tolerant Clustering (IEEE Services 2012)

Workflows  Used  •  Periodogram  –  Iden2fy   periodic   signals   from   light   curves   that  arise  from  transi2ng  planets.    

– 216,600  tasks,  19GB  input  data.    – Periodogram  has  only  one  level  

18  Image  from  hhp://pegasus.isi.edu/presenta2ons/2011/sci709-­‐voeckler-­‐talk.ppt/  

Page 19: Fault Tolerant Clustering (IEEE Services 2012)

Simulator  

•  Extension  to  CloudSim  – Workflow  Engine  – Clustering  Engine  – Scheduler  – Failure  Generator  – Failure  Monitor  

19  

Page 20: Fault Tolerant Clustering (IEEE Services 2012)

Performance  •  NOOP:  no  op2miza2on,  (k=n/r)  •  DC  (Dynamic  Clustering)    •  SR  (Selec2ve  Reclustering)  •  DR  (  Dynamic  Reclustering)  •  Overall  Run2me  in  seconds  

20  

Page 21: Fault Tolerant Clustering (IEEE Services 2012)

Performance  

•  Periodogram  

21  

Page 22: Fault Tolerant Clustering (IEEE Services 2012)

Performance  

•  Montage  

22  

Page 23: Fault Tolerant Clustering (IEEE Services 2012)

Task  Specific  Failure  Detec2on  (TSFD)  •  Task  Failures  are  related  to  the  type  of  tasks  •  Failure  Monitor  classifies  failures  based  on  the  type    •  Clustering   Engine   merges   tasks   based   on   different   task  

failure  rates  •  In   this   experiment   of  Montage,   we   set   the   task   failure  

rate   of   mProjectPP   and   mDiffFit   to   be   0.001   while  mBackground  ranges  from  0.2  to  0.8.    

α1 Optimization Methods

DR DR+TSFD DC DC+TSFD

0.2 10415 10412 13804 13820

0.4 11830 11839 22946 22923

0.6 14704 14688 60429 60414

0.8 23238 23229 436638 435297

1 The task failure rate of mBackground only!

23  

Page 24: Fault Tolerant Clustering (IEEE Services 2012)

Task  Failure  Model  

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−α)k

, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

#

$

%%

&

%%

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =

n(k*t + d)rk(1−α)k

*

ttotal  is  not  sensi2ve  to  α    

24  Simplifica2on  of  failures  is  acceptable    

Page 25: Fault Tolerant Clustering (IEEE Services 2012)

Loca2on  Specific  Failure  Detec2on  (LSFD)  •  Task  Failures  are  related  to  the  loca2on  of  execu2on  •  Failure  Monitor  classifies   failures  based  on  resource  id  

•  Scheduler  orders  resources  based  on  their  reliability.  •  Two  out  of   twenty  nodes  have  a  higher   task   failure  rates   (from  0.2   to  0.8)  while  others   s2ll  have  a   task  failure  rate  of  0.001.    

25  

DC  generates  many  small  tasks  if  task  failure  rate  is  high  

Page 26: Fault Tolerant Clustering (IEEE Services 2012)

Conclusion  

•  We  present  three  basic  methods  to  improve  fault  tolerance  in  task  clustering  

•  If  the  system  supports  iden2fica2on  of  failed  tasks,  dynamic  reclustering  performs  best  

•  Otherwise,  use  dynamic  clustering  •  Improvement  is  significant  even  for  very  basic  method  

26  

Page 27: Fault Tolerant Clustering (IEEE Services 2012)

Future  Work  

•  Ver2cal  Clustering  and  Arbitrary  Clustering  •  Intelligent  Scheduler  •  More  Workflow  Examples  •  Distribu2on  of  Failures  

27  

Page 28: Fault Tolerant Clustering (IEEE Services 2012)

Ques2ons?  

•  Thank  you  for  coming!  •  For  further  info,  please  visit:  pegasus.isi.edu  or  email  [email protected]  

28  

Page 29: Fault Tolerant Clustering (IEEE Services 2012)

Refinements  •  When  n>>r  does  not  hold  in  the  end  of  execu2on  

•  Default:    •  Replica2ve:                                                    replicate  jobs  by  •  Even:    

kactual = k* njobs =

ntaskk

< r

kactual = k* njobs = r

rntask / k

kactual =ntaskr

njobs = r

29  

Page 30: Fault Tolerant Clustering (IEEE Services 2012)

Dynamic  Performance  

•  TFM  and  DC  

30