paper presentation: taverna, reloaded

SSDBMHeidelberg, GermanyJune 30 - July 2, 2010

, reloaded

Paolo Missier, Stian Soiland-Reyes, Stuart Owen, Alex Nenadic,Ian Dunlop, Alan Williams, Carole Goble

School of Computer ScienceUniversity of Manchester, UK

Wei TanArgonne National Laboratory

Argonne, IL, USA

Tom OinnEBI, UK

Taverna: scientific workflow management

2http://www/myexperiment.org/workflows/746

Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types

source: caGrid

http://www/myexperiment.org/workflows/746





lymphoma samples ➔ hybridization datasource: caGrid






lymphoma samples ➔ hybridization data

process microarray data as training dataset

source: caGrid






lymphoma samples ➔ hybridization data

process microarray data as training dataset

learn predictive model

source: caGrid






lymphoma samples ➔ hybridization datasource: caGrid



• Taverna for data-intensive science– architectural goals... and how they are being achieved

• Performance figures on benchmark workflows

3

In this (short) talk...

Target -ilities

4

• Scalability– size of input data / size of input collections

• Configurability

– each processor in the workflow individually tunable to adjust to the underlying platform

• Extensibility– plugin architecture to accommodate specific service

groups• caGrid, BioMoby

– extensible set of operators that define the semantics of the model

Opportunities for parallel processing

5

[ id1, id2, id3, ...]

• intra-processor: implicit iteration over list data• inter-processor: pipelining


5

[ id1, id2, id3, ...]


SFH1 SFH2 SFH3

... ... ...

getDS1 getDS2 getDS3


5

[ id1, id2, id3, ...]


[ DS1, DS2, DS3, ...]

SFH1 SFH2 SFH3

... ... ...

getDS1 getDS2 getDS3

Default dispatch stack

Parallelise

Error bounce

Failover

Stop/pause

Retry

Request

Response

Provenance

Invoke

Configurable processor behaviour

allocates concurrent threads to list elements

captures service invocation events

persistent server error:tries each available service implementation in turn

transient errors:retries one service invocation

interacts with underlying activity- service operation- shell script execution

Extensible processor semantics

7


Parallelise

Error bounce

Failover

Stop/pause

Retry

Request

Response

Provenance

Invoke


7


Parallelise

Error bounce

Failover

Stop/pause

Retry

Request

Response

Provenance

Invoke


Parallelise

Error bounce

Failover

Invoke

Retry

Re

qu

es

t

(a) (b)

Re

sp

on

se

Re

qu

es

t

Re

sp

on

se

Extended dispatch stack

Parallelise

Error bounce

Failover

Invoke

Retry

Loop/Branching

Enhancing dataflow model with limited form of loops to model busy waiting(asynchronous service interaction)


7


Parallelise

Error bounce

Failover

Stop/pause

Retry

Request

Response

Provenance

Invoke


Parallelise

Error bounce

Failover

Invoke

Retry

Re

qu

es

t

(a) (b)

Re

sp

on

se

Re

qu

es

t

Re

sp

on

se


Parallelise

Error bounce

Failover

Invoke

Retry

Loop/Branching


!"#$#%&'()(*+,-%$.*//$0*1

2*(34/*5678&.8&9

2*(34/*56#%.8&9

!"#$%&'&(%

!"#$)*

+"#,-

./0.12&'&(%

%&'&(% '&&'./304&5,%&

+"#$%&'&(%

)%$-"40

,%$-"40

&0%&

2(..0%%

+"#,- &670

80&$9:5$;0%(<&

;0%(<& '&&'./304&5,%&

."4&04& 7';'3%

;(4)4&0;=;"2.'4

'&&'./304&5,%& +"#,-

Busy waiting using an explicit workflow pattern


7


Parallelise

Error bounce

Failover

Stop/pause

Retry

Request

Response

Provenance

Invoke


Parallelise

Error bounce

Failover

Invoke

Retry

Re

qu

es

t

(a) (b)

Re

sp

on

se

Re

qu

es

t

Re

sp

on

se


Parallelise

Error bounce

Failover

Invoke

Retry

Loop/Branching


!"#$#%&'()(*+,-%$.*//$0*1

2*(34/*5678&.8&9

2*(34/*56#%.8&9

!"#$%&'&(%

!"#$)*

+"#,-

./0.12&'&(%

%&'&(% '&&'./304&5,%&

+"#$%&'&(%

)%$-"40

,%$-"40

&0%&

2(..0%%

+"#,- &670

80&$9:5$;0%(<&

;0%(<& '&&'./304&5,%&

."4&04& 7';'3%

;(4)4&0;=;"2.'4

'&&'./304&5,%& +"#,-

Busy waiting using an explicit workflow pattern

!"#$% &'()

*)&+,-.+/)012&

/)012& 3&&3456)7&.$0&

(3/360 4"7&)7&

/1787&)/9/":437

3&&3456)7&.$0& !"#$%

!"#$%

45)4;:&3&10

3&&3456)7&.$0& 0&3&10

Busy waiting using an loop processor

Performance experimental setup• previous version of Taverna engine used as baseline• objective: to measure incremental improvement

8

mul

tiple

par

alle

l pip

elin

es

list generatorParameters:- byte size of list elements (strings)- size of input list- length of linear chain

main insight: when the workflow is designed for pipelining, parallelism is exploited effectively

I - Memory usage

9

T2 main memory data management

T2 main embedded Derby back-end

T1 baseline

list size: 1,000 strings of 10K chars eachno intra-processor parallelism (1 thread/processor)

shorter execution time due to pipelining

Available processors pool

10

pipelining in T2 makes up for smaller pools of threads/processor

Bounded main memory usage

11

Separation of data and process spaces ensures scalable data management

varying data element size:10K, 25K, 100K chars

Summary• Taverna 2 re-engineered for scalability on data-

intensive pipelined dataflows

• separation of data and process spaces• generic stack pattern for principled extensions

– limited while loop, if-then-else– provenance capture– ... open plugin architecture accommodates further

extensions

12

For further details and a nice chat:please visit our poster!!

paper presentation: taverna, reloaded

Technology

lymphoma type prediction

list data interprocessor

cagrid hybridization

cagrid http

cancer diagnosis

microarray analysis

predictive model http

implicit iteration