paper presentation: taverna, reloaded
DESCRIPTION
citation: Missier, P., Soiland-Reyes, S., Owen, S., Tan, W., Nenadic, A., Dunlop, I., et al. (2010). Taverna, reloaded. In M. Gertz, T. Hey, & B. Ludaescher (Eds.), Procs. SSDBM 2010. Heidelberg, Germany. Paper available at: http://www.ssdbm2010.org/.TRANSCRIPT
SSDBMHeidelberg, GermanyJune 30 - July 2, 2010
, reloaded
Paolo Missier, Stian Soiland-Reyes, Stuart Owen, Alex Nenadic,Ian Dunlop, Alan Williams, Carole Goble
School of Computer ScienceUniversity of Manchester, UK
Wei TanArgonne National Laboratory
Argonne, IL, USA
Tom OinnEBI, UK
Taverna: scientific workflow management
2http://www/myexperiment.org/workflows/746
Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types
source: caGrid
Taverna: scientific workflow management
2http://www/myexperiment.org/workflows/746
Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types
lymphoma samples ➔ hybridization datasource: caGrid
Taverna: scientific workflow management
2http://www/myexperiment.org/workflows/746
Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types
lymphoma samples ➔ hybridization data
process microarray data as training dataset
source: caGrid
Taverna: scientific workflow management
2http://www/myexperiment.org/workflows/746
Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types
lymphoma samples ➔ hybridization data
process microarray data as training dataset
learn predictive model
source: caGrid
Taverna: scientific workflow management
2http://www/myexperiment.org/workflows/746
Goal: perform cancer diagnosis using microarray analysis- learn a model for lymphoma type prediction based on samples from different lymphoma types
lymphoma samples ➔ hybridization datasource: caGrid
• Taverna for data-intensive science– architectural goals... and how they are being achieved
• Performance figures on benchmark workflows
3
In this (short) talk...
Target -ilities
4
• Scalability– size of input data / size of input collections
• Configurability
– each processor in the workflow individually tunable to adjust to the underlying platform
• Extensibility– plugin architecture to accommodate specific service
groups• caGrid, BioMoby
– extensible set of operators that define the semantics of the model
Opportunities for parallel processing
5
[ id1, id2, id3, ...]
• intra-processor: implicit iteration over list data• inter-processor: pipelining
Opportunities for parallel processing
5
[ id1, id2, id3, ...]
• intra-processor: implicit iteration over list data• inter-processor: pipelining
SFH1 SFH2 SFH3
... ... ...
getDS1 getDS2 getDS3
Opportunities for parallel processing
5
[ id1, id2, id3, ...]
• intra-processor: implicit iteration over list data• inter-processor: pipelining
[ DS1, DS2, DS3, ...]
SFH1 SFH2 SFH3
... ... ...
getDS1 getDS2 getDS3
Default dispatch stack
Parallelise
Error bounce
Failover
Stop/pause
Retry
Request
Response
Provenance
Invoke
Configurable processor behaviour
allocates concurrent threads to list elements
captures service invocation events
persistent server error:tries each available service implementation in turn
transient errors:retries one service invocation
interacts with underlying activity- service operation- shell script execution
Extensible processor semantics
7
Default dispatch stack
Parallelise
Error bounce
Failover
Stop/pause
Retry
Request
Response
Provenance
Invoke
Extensible processor semantics
7
Default dispatch stack
Parallelise
Error bounce
Failover
Stop/pause
Retry
Request
Response
Provenance
Invoke
Default dispatch stack
Parallelise
Error bounce
Failover
Invoke
Retry
Re
qu
es
t
(a) (b)
Re
sp
on
se
Re
qu
es
t
Re
sp
on
se
Extended dispatch stack
Parallelise
Error bounce
Failover
Invoke
Retry
Loop/Branching
Enhancing dataflow model with limited form of loops to model busy waiting(asynchronous service interaction)
Extensible processor semantics
7
Default dispatch stack
Parallelise
Error bounce
Failover
Stop/pause
Retry
Request
Response
Provenance
Invoke
Default dispatch stack
Parallelise
Error bounce
Failover
Invoke
Retry
Re
qu
es
t
(a) (b)
Re
sp
on
se
Re
qu
es
t
Re
sp
on
se
Extended dispatch stack
Parallelise
Error bounce
Failover
Invoke
Retry
Loop/Branching
Enhancing dataflow model with limited form of loops to model busy waiting(asynchronous service interaction)
!"#$#%&'()(*+,-%$.*//$0*1
2*(34/*5678&.8&9
2*(34/*56#%.8&9
!"#$%&'&(%
!"#$)*
+"#,-
./0.12&'&(%
%&'&(% '&&'./304&5,%&
+"#$%&'&(%
)%$-"40
,%$-"40
&0%&
2(..0%%
+"#,- &670
80&$9:5$;0%(<&
;0%(<& '&&'./304&5,%&
."4&04& 7';'3%
;(4)4&0;=;"2.'4
'&&'./304&5,%& +"#,-
Busy waiting using an explicit workflow pattern
Extensible processor semantics
7
Default dispatch stack
Parallelise
Error bounce
Failover
Stop/pause
Retry
Request
Response
Provenance
Invoke
Default dispatch stack
Parallelise
Error bounce
Failover
Invoke
Retry
Re
qu
es
t
(a) (b)
Re
sp
on
se
Re
qu
es
t
Re
sp
on
se
Extended dispatch stack
Parallelise
Error bounce
Failover
Invoke
Retry
Loop/Branching
Enhancing dataflow model with limited form of loops to model busy waiting(asynchronous service interaction)
!"#$#%&'()(*+,-%$.*//$0*1
2*(34/*5678&.8&9
2*(34/*56#%.8&9
!"#$%&'&(%
!"#$)*
+"#,-
./0.12&'&(%
%&'&(% '&&'./304&5,%&
+"#$%&'&(%
)%$-"40
,%$-"40
&0%&
2(..0%%
+"#,- &670
80&$9:5$;0%(<&
;0%(<& '&&'./304&5,%&
."4&04& 7';'3%
;(4)4&0;=;"2.'4
'&&'./304&5,%& +"#,-
Busy waiting using an explicit workflow pattern
!"#$% &'()
*)&+,-.+/)012&
/)012& 3&&3456)7&.$0&
(3/360 4"7&)7&
/1787&)/9/":437
3&&3456)7&.$0& !"#$%
!"#$%
45)4;:&3&10
3&&3456)7&.$0& 0&3&10
Busy waiting using an loop processor
Performance experimental setup• previous version of Taverna engine used as baseline• objective: to measure incremental improvement
8
mul
tiple
par
alle
l pip
elin
es
list generatorParameters:- byte size of list elements (strings)- size of input list- length of linear chain
main insight: when the workflow is designed for pipelining, parallelism is exploited effectively
I - Memory usage
9
T2 main memory data management
T2 main embedded Derby back-end
T1 baseline
list size: 1,000 strings of 10K chars eachno intra-processor parallelism (1 thread/processor)
shorter execution time due to pipelining
I - Memory usage
9
T2 main memory data management
T2 main embedded Derby back-end
T1 baseline
list size: 1,000 strings of 10K chars eachno intra-processor parallelism (1 thread/processor)
shorter execution time due to pipelining
Available processors pool
10
pipelining in T2 makes up for smaller pools of threads/processor
Bounded main memory usage
11
Separation of data and process spaces ensures scalable data management
varying data element size:10K, 25K, 100K chars
Summary• Taverna 2 re-engineered for scalability on data-
intensive pipelined dataflows
• separation of data and process spaces• generic stack pattern for principled extensions
– limited while loop, if-then-else– provenance capture– ... open plugin architecture accommodates further
extensions
12
For further details and a nice chat:please visit our poster!!