paper presentation: taverna, reloaded

22
SSDBM Heidelberg, Germany June 30 - July 2, 2010 , reloaded Paolo Missier, Stian Soiland-Reyes, Stuart Owen, Alex Nenadic, Ian Dunlop, Alan Williams, Carole Goble School of Computer Science University of Manchester, UK Wei Tan Argonne National Laboratory Argonne, IL, USA Tom Oinn EBI, UK

Upload: paolo-missier

Post on 11-May-2015

483 views

Category:

Technology


5 download

DESCRIPTION

citation: Missier, P., Soiland-Reyes, S., Owen, S., Tan, W., Nenadic, A., Dunlop, I., et al. (2010). Taverna, reloaded. In M. Gertz, T. Hey, & B. Ludaescher (Eds.), Procs. SSDBM 2010. Heidelberg, Germany. Paper available at: http://www.ssdbm2010.org/.

TRANSCRIPT

Page 1: Paper presentation: Taverna, reloaded

SSDBMHeidelberg, GermanyJune 30 - July 2, 2010

, reloaded

Paolo Missier, Stian Soiland-Reyes, Stuart Owen, Alex Nenadic,Ian Dunlop, Alan Williams, Carole Goble

School of Computer ScienceUniversity of Manchester, UK

Wei TanArgonne National Laboratory

Argonne, IL, USA

Tom OinnEBI, UK

Page 2: Paper presentation: Taverna, reloaded

Taverna: scientific workflow management

2http://www/myexperiment.org/workflows/746

Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types

source: caGrid

Page 3: Paper presentation: Taverna, reloaded

Taverna: scientific workflow management

2http://www/myexperiment.org/workflows/746

Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types

lymphoma samples ➔ hybridization datasource: caGrid

Page 4: Paper presentation: Taverna, reloaded

Taverna: scientific workflow management

2http://www/myexperiment.org/workflows/746

Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types

lymphoma samples ➔ hybridization data

process microarray data as training dataset

source: caGrid

Page 5: Paper presentation: Taverna, reloaded

Taverna: scientific workflow management

2http://www/myexperiment.org/workflows/746

Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types

lymphoma samples ➔ hybridization data

process microarray data as training dataset

learn predictive model

source: caGrid

Page 6: Paper presentation: Taverna, reloaded

Taverna: scientific workflow management

2http://www/myexperiment.org/workflows/746

Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types

lymphoma samples ➔ hybridization datasource: caGrid

Page 7: Paper presentation: Taverna, reloaded

• Taverna for data-intensive science– architectural goals... and how they are being achieved

• Performance figures on benchmark workflows

3

In this (short) talk...

Page 8: Paper presentation: Taverna, reloaded

Target -ilities

4

• Scalability– size of input data / size of input collections

• Configurability

– each processor in the workflow individually tunable to adjust to the underlying platform

• Extensibility– plugin architecture to accommodate specific service

groups• caGrid, BioMoby

– extensible set of operators that define the semantics of the model

Page 9: Paper presentation: Taverna, reloaded

Opportunities for parallel processing

5

[ id1, id2, id3, ...]

• intra-processor: implicit iteration over list data• inter-processor: pipelining

Page 10: Paper presentation: Taverna, reloaded

Opportunities for parallel processing

5

[ id1, id2, id3, ...]

• intra-processor: implicit iteration over list data• inter-processor: pipelining

SFH1 SFH2 SFH3

... ... ...

getDS1 getDS2 getDS3

Page 11: Paper presentation: Taverna, reloaded

Opportunities for parallel processing

5

[ id1, id2, id3, ...]

• intra-processor: implicit iteration over list data• inter-processor: pipelining

[ DS1, DS2, DS3, ...]

SFH1 SFH2 SFH3

... ... ...

getDS1 getDS2 getDS3

Page 12: Paper presentation: Taverna, reloaded

Default dispatch stack

Parallelise

Error bounce

Failover

Stop/pause

Retry

Request

Response

Provenance

Invoke

Configurable processor behaviour

allocates concurrent threads to list elements

captures service invocation events

persistent server error:tries each available service implementation in turn

transient errors:retries one service invocation

interacts with underlying activity- service operation- shell script execution

Page 13: Paper presentation: Taverna, reloaded

Extensible processor semantics

7

Default dispatch stack

Parallelise

Error bounce

Failover

Stop/pause

Retry

Request

Response

Provenance

Invoke

Page 14: Paper presentation: Taverna, reloaded

Extensible processor semantics

7

Default dispatch stack

Parallelise

Error bounce

Failover

Stop/pause

Retry

Request

Response

Provenance

Invoke

Default dispatch stack

Parallelise

Error bounce

Failover

Invoke

Retry

Re

qu

es

t

(a) (b)

Re

sp

on

se

Re

qu

es

t

Re

sp

on

se

Extended dispatch stack

Parallelise

Error bounce

Failover

Invoke

Retry

Loop/Branching

Enhancing dataflow model with limited form of loops to model busy waiting(asynchronous service interaction)

Page 15: Paper presentation: Taverna, reloaded

Extensible processor semantics

7

Default dispatch stack

Parallelise

Error bounce

Failover

Stop/pause

Retry

Request

Response

Provenance

Invoke

Default dispatch stack

Parallelise

Error bounce

Failover

Invoke

Retry

Re

qu

es

t

(a) (b)

Re

sp

on

se

Re

qu

es

t

Re

sp

on

se

Extended dispatch stack

Parallelise

Error bounce

Failover

Invoke

Retry

Loop/Branching

Enhancing dataflow model with limited form of loops to model busy waiting(asynchronous service interaction)

!"#$#%&'()(*+,-%$.*//$0*1

2*(34/*5678&.8&9

2*(34/*56#%.8&9

!"#$%&'&(%

!"#$)*

+"#,-

./0.12&'&(%

%&'&(% '&&'./304&5,%&

+"#$%&'&(%

)%$-"40

,%$-"40

&0%&

2(..0%%

+"#,- &670

80&$9:5$;0%(<&

;0%(<& '&&'./304&5,%&

."4&04& 7';'3%

;(4)4&0;=;"2.'4

'&&'./304&5,%& +"#,-

Busy waiting using an explicit workflow pattern

Page 16: Paper presentation: Taverna, reloaded

Extensible processor semantics

7

Default dispatch stack

Parallelise

Error bounce

Failover

Stop/pause

Retry

Request

Response

Provenance

Invoke

Default dispatch stack

Parallelise

Error bounce

Failover

Invoke

Retry

Re

qu

es

t

(a) (b)

Re

sp

on

se

Re

qu

es

t

Re

sp

on

se

Extended dispatch stack

Parallelise

Error bounce

Failover

Invoke

Retry

Loop/Branching

Enhancing dataflow model with limited form of loops to model busy waiting(asynchronous service interaction)

!"#$#%&'()(*+,-%$.*//$0*1

2*(34/*5678&.8&9

2*(34/*56#%.8&9

!"#$%&'&(%

!"#$)*

+"#,-

./0.12&'&(%

%&'&(% '&&'./304&5,%&

+"#$%&'&(%

)%$-"40

,%$-"40

&0%&

2(..0%%

+"#,- &670

80&$9:5$;0%(<&

;0%(<& '&&'./304&5,%&

."4&04& 7';'3%

;(4)4&0;=;"2.'4

'&&'./304&5,%& +"#,-

Busy waiting using an explicit workflow pattern

!"#$% &'()

*)&+,-.+/)012&

/)012& 3&&3456)7&.$0&

(3/360 4"7&)7&

/1787&)/9/":437

3&&3456)7&.$0& !"#$%

!"#$%

45)4;:&3&10

3&&3456)7&.$0& 0&3&10

Busy waiting using an loop processor

Page 17: Paper presentation: Taverna, reloaded

Performance experimental setup• previous version of Taverna engine used as baseline• objective: to measure incremental improvement

8

mul

tiple

par

alle

l pip

elin

es

list generatorParameters:- byte size of list elements (strings)- size of input list- length of linear chain

main insight: when the workflow is designed for pipelining, parallelism is exploited effectively

Page 18: Paper presentation: Taverna, reloaded

I - Memory usage

9

T2 main memory data management

T2 main embedded Derby back-end

T1 baseline

list size: 1,000 strings of 10K chars eachno intra-processor parallelism (1 thread/processor)

shorter execution time due to pipelining

Page 19: Paper presentation: Taverna, reloaded

I - Memory usage

9

T2 main memory data management

T2 main embedded Derby back-end

T1 baseline

list size: 1,000 strings of 10K chars eachno intra-processor parallelism (1 thread/processor)

shorter execution time due to pipelining

Page 20: Paper presentation: Taverna, reloaded

Available processors pool

10

pipelining in T2 makes up for smaller pools of threads/processor

Page 21: Paper presentation: Taverna, reloaded

Bounded main memory usage

11

Separation of data and process spaces ensures scalable data management

varying data element size:10K, 25K, 100K chars

Page 22: Paper presentation: Taverna, reloaded

Summary• Taverna 2 re-engineered for scalability on data-

intensive pipelined dataflows

• separation of data and process spaces• generic stack pattern for principled extensions

– limited while loop, if-then-else– provenance capture– ... open plugin architecture accommodates further

extensions

12

For further details and a nice chat:please visit our poster!!