a local optimization based strategy for cost-effective datasets storage of scientific applications...

PowerPoint Presentation

A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the CloudMany slides from authors presentation on CLOUD 2011

Presenter: Guagndong LiuMar 13th, 2012Dec 8th , 2011 Dec 8th , 2011 1OutlineIntroductionA Motivating ExampleProblem AnalysisImportant Concepts and Cost Model of Datasets Storage in the CloudA Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Evaluation and Simulation

Dec 8th , 2011 Dec 8th , 2011 IntroductionScientific applicationsComputation and data intensiveGenerated data sets: terabytes or even petabytes in sizeHuge computation: e.g. scientific workflowIntermediate data: important!Reuse or reanalyzeFor sharing between institutionsRegeneration vs storing

Dec 8th , 2011 Dec 8th , 2011 IntroductionCloud computingA new way for deploying scientific applicationsPay-as-you-go modelStoring strategyWhich generated dataset should be stored?Tradeoff between cost and user preferenceCost-effective strategy Dec 8th , 2011 Dec 8th , 2011 A Motivating ExampleParkes radio telescope and pulsar surveyPulsar searching workflow

Dec 8th , 2011 Dec 8th , 2011 A Motivating ExampleCurrent storage strategy Delete all the intermediate data, due to storage limitationSome intermediate data should be storedSome need not

Dec 8th , 2011 Dec 8th , 2011 Problem AnalysisWhich datasets should be stored?Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006]Different strategies correspond to different costsScientific workflows are very complex and there are dependencies among datasetsFurthermore, one scientist can not decide the storage status of a dataset anymoreData accessing delay Datasets should be stored based on the trade-off of computation cost and storage costA cost-effective datasets storage strategy is needed

Dec 8th , 2011 Dec 8th , 2011 Important ConceptsData Dependency Graph (DDG)A classification of the application data Original data and generated dataData provenanceA kind of meta-data that records how data are generatedDDG

Dec 8th , 2011 Dec 8th , 2011 Important ConceptsAttributes of a Dataset in DDGA dataset di in DDG has the attributes: xi ($) denotes the generation cost of dataset di from its direct predecessors. yi ($/t) denotes the cost of storing dataset di in the system per time unit. fi (Boolean) is a flag, which denotes the status whether dataset di is stored or deleted in the system. vi (Hz) denotes the usage frequency, which indicates how often di is used.Dec 8th , 2011 Dec 8th , 2011 Important ConceptsAttributes of a Dataset in DDGprovSeti denotes the set of stored provenances that are needed when regenerating dataset di.

CostRi ($/t) is dis cost rate, which means the average cost per time unit of di in the system.

Cost = C + SC: total cost of computation resourcesS: total cost of storage resources

Dec 8th , 2011 Dec 8th , 2011 Cost Model of Datasets Storage in the CloudTotal cost rate of a DDG:S is the storage strategy of the DDG

For a DDG with n datasets, there are 2n different storage strategies

Dec 8th , 2011 Dec 8th , 2011 CTT-SP AlgorithmTo find the minimum cost storage strategy for a DDGPhilosophy of the algorithm:Construct a Cost Transitive Tournament (CTT) based on the DDG. In the CTT, the paths (from the start to the end dataset) have one-to-one mapping to the storage strategies of the DDGThe length of each path equals to the total cost rate of the corresponding storage strategy.The Shortest Path (SP) represents the minimum cost storage strategy

Dec 8th , 2011 Dec 8th , 2011 CTT-SP AlgorithmExampleThe weights of cost edges:

Dec 8th , 2011 Dec 8th , 2011 A Local-Optimization based Datasets Storage StrategyRequirements of Storage StrategyEfficiency and ScalabilityThe strategy is used at runtime in the cloud and the DDG may be largeThe strategy itself takes computation resourcesReflect users preference and data accessing delayUsers may want to store some datasets Users may have certain tolerance of data accessing delay

Dec 8th , 2011 Dec 8th , 2011 A Local-Optimization based Datasets Storage StrategyIntroduce two new attributes of the datasets in DDG to represent users accessing delay tolerance, which are Ti is a duration of time that denotes users tolerance of dataset dis accessing delay

i is the parameter to denote users cost related tolerance of dataset dis accessing delay, which is a value between 0 and 1

Dec 8th , 2011 Dec 8th , 2011 A Local-Optimization based Datasets Storage Strategy

Dec 8th , 2011 Dec 8th , 2011 A Local-Optimization based Datasets Storage StrategyEfficiency and ScalabilityA general DDG is very complex. The computation complexity of CTT-SP algorithm is O(n9), which is not efficient and scalable to be used on large DDGsPartition the large DDG into small linear segments

Utilize CTT-SP algorithm on linear DDG segments in order to guarantee a localized optimum

Dec 8th , 2011 Dec 8th , 2011 EvaluationUse random generated DDG for simulationSize: randomly distributed from 100GB to 1TB.Generation time : randomly distributed from 1 hour to 10 hoursUsage frequency: randomly distributed 1 day to 10 days (time between every usage). Users delay tolerance (Ti) , randomly distributed from 10 hours to one day Cost parameter (i) : randomly distributed from 0.7 to 1 to every datasets in the DDGAdopt Amazon cloud services price model (EC2+S3):$0.15 per Gigabyte per month for the storage resources.$0.1 per CPU hour for the computation resources.

Dec 8th , 2011 Dec 8th , 2011 EvaluationCompare different storage strategies with proposed strategyUsage based strategyGeneration cost based strategyCost rate based strategy Dec 8th , 2011 Dec 8th , 2011 Evaluation

Dec 8th , 2011 Dec 8th , 2011 Evaluation

Dec 8th , 2011 Dec 8th , 2011

2007 The Board of Regents of the University of Nebraska. All rights reserved.Thanks

Dec 8th , 2011 Dec 8th , 2011 Dec 8th , 2011 22Candidates

Candidates

Beam

Beam

De-disperse

Acceleate

Raw beam data

Accelerated De-dispersion files

De-dispersion files

Extracted & compressed beam

Seek results files

Candidate list

XML files

Size:

Generation time:

20 GB

245 mins

1 mins

80 mins

300 mins

790 mins

27 mins

25 KB

1 KB

16 MB

90 GB

90 GB

d1

d2

d3

d8

d7

d6

d4

d5

d1

d2

d3

(x1 , y1 ,v1)

(x3 , y3 ,v3)

(x2 , y2 ,v2)

S1 : f1 =1 f2 =0 f3 =0

S2 : f1 =0 f2 =0 f3 =1

...

y1

d1

d2

d3

(x1 , y1 ,v1)

(x3 , y3 ,v3)

(x2 , y2 ,v2)

x1v1+y2

d1

d2

d3

ds

de

x3v3

x2v2+y3

x2v2+(x2+x3)v3

x1v1+(x1+x2)v2+(x1+x2+x3)v3

x1v1+(x1+x2)v2+y3

y2

y3

0

DDG

CTT

...

...

...

...

Linear DDG1

Linear DDG3

Linear DDG2

Linear DDG4

Partitioning point dataset

Partitioning point dataset

a local optimization based strategy for cost-effective datasets storage of scientific applications...

Documents

cost model of datasets

generation cost of dataset

storage status

minimum cost storage

average cost

tradeoff of computation

application data original

data challenge