a self-adaptive auto-scaling method for scientiﬁc ...a self-adaptive auto-scaling method for...

A Self-adaptive Auto-scaling Method for ScientificApplications on HPC Environments and Clouds

Kiran MantripragadaIBM Research - Brazil

São Paulo, [email protected]

Alécio P. D. BinottoIBM Research - Brazil


Leonardo P. TizzeiIBM Research - Brazil


ABSTRACTHigh intensive computation applications can usually takedays to months to finish an execution. During this time, itis common to have variations of the available resources whenconsidering that such hardware is usually shared among aplurality of researchers/departments within an organization.On the other hand, High Performance Clusters can take ad-vantage of Cloud Computing bursting techniques for the ex-ecution of applications together with on-premise resources.In order to meet deadlines, high intensive computationalapplications can use the Cloud to boost their performancewhen they are data and task parallel. This article presentsan ongoing work towards the use of extended resources of anHPC execution platform together with Cloud. We proposean unified view of such heterogeneous environments and amethod that monitors, predicts the application executiontime, and dynamically shifts part of the domain – previ-ously running in local HPC hardware – to be computed inthe Cloud, meeting then a specific deadline. The methodis exemplified along with a seismic application that, at run-time, adapts itself to move part of the processing to theCloud (in a movement called bursting) and also auto-scales(the moved part) over cloud nodes. Our preliminary resultsshow that there is an expected overhead for performing thismovement and for synchronizing results, but the outcomesdemonstrate it is an important feature for meeting deadlinesin the case an on-premise cluster is overloaded or cannotprovide the capacity needed for a particular project.

Categories and Subject DescriptorsComputer systems organization [Architectures]: Distributedarchitectures—Cloud computing ; Theory of computation [De-sign and analysis of algorithms]: Parallel algorithms—Self-organization.

KeywordsCloud computing, Self-adaptation, Load-balancing, Cloud

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrights forthird-party components of this work must be honored. For all other uses,contact the Owner/Author. Copyright is held by the owner/author(s).

ADAPT’15 , Jan 21 2015, Amsterdam, Netherlands.Part of ADAPT Workshop proceedings, 2015 (arXiv:1412.2347).ADAPT/2015/01.

bursting, Dynamic federation to cloud.

1. INTRODUCTIONHigh Performance Computing (HPC) platforms are de-

signed to be explored by intensive computing applicationsto obtain performance, such as seismic analysis (for the ex-ploration of oil and gas and natural disasters), weather andflooding forecasts, computational fluid dynamics, DNA se-quencing, simulations of electromagnetic equation, amongothers. They are dedicated to provide their maximum through-put; and both applications and resources are usually fine-tuned to perform as optimally as possible for their specificchallenges. Companies, government laboratories, and re-search/educational institutions acquire such HPC resourcesbuilt on clusters of supercomputers, which are typically ex-pensive to acquire and maintain. However, there is a consid-erable amount of time that such environments can stay idle,while other times it can be overloaded or even has capacitylimitations that, together, prevents a particular project orsimulation deadline to be achieved.

Cloud computing, on the other hand, proposes a resource-shared model in which users allocate resources on demandfrom providers only when necessary in a “pay-as-you-run”model, spending budget only for utilized processing time.Cloud was initially created to serve mostly Web applica-tions deployed on Virtual Machines (VM) that share thesame physical hardware with other VMs at the same time.As it matured, this model was coined Public Cloud and an-other model has emerged to deal with private data and alsoHPC requirements: the Private Cloud. It focuses on host-ing computing intensive industrial and scientific workloadsthat deal with sensitive and large amounts of data – on de-mand, but in a private and specialized space. This lattermodel have then specific requirements that are challengingto be met by Cloud providers, like the use of specializedresources (e.g., accelerators, GPUs, Infiniband, parallel filesystems, etc), transfer and storage of large data sets, visu-alization, security and privacy, control, and a collaborationinfrastructure. Currently, providers are starting to offer Pri-vate Clouds with such specialized resources and this move-ment can eventually allows a hybrid execution, i.e., the useof on-premise HPC Clusters together with Clouds for boost-ing the execution platform at a certain period and to allowparticular projects to meet their deadlines.

We propose a method to partition tasks and data of anapplication in both environments in an unified view. Weuse a seismic application as use case. This particular ap-plication can reflect the characteristics of a broad range of

arX

iv:1

412.

6392

v3 [

cs.D

C]

26

Jan

2015

scientific and industrial applications: it is computational de-mand, data and task parallel, communication bound, utilizessolvers for differential equations, produces large amountsof data to be visualized as post-processing, etc. Our ap-proach is able to monitor the application at runtime andto predict its total execution time so that it can performa self-adaptation towards a Cloud Bursting, i.e., automati-cally shifts part of the data from a Cluster to be processedby the Cloud, in an unified and synchronized way, with thegoal to meet a time deadline.

This article is organized as the following: Section 2 de-scribes the proposed approach by emphasizing how the methodperforms the self-adaptation, auto-scaling, and automaticbursting, i.e., the application adapts its execution on bothplatforms at runtime; Section 3 exemplifies the approachusing a seismic application as an example and shows pre-liminary outcomes with a discussion; and Section 4 finalizesthe proposal with related work, followed by conclusions onSection 5.

2. DYNAMIC AND SELF-ADAPTIVE CLOUDBURSTING

Let us consider an application that is data and task par-allel, and which also evolves over time. Initially, the userstarts the application according to specific parameters: data(domain size), tasks, number of timesteps, partitions of thedata to be executed on-premise and in the cloud (data canbe initialized fully placed on-premise). The system immedi-ately starts to monitor the execution time for each timestep.The simulation continues to execute until the system detectsthat the threshold time, i.e., the deadline will not be satis-fied. This situation can happen for a number of reasons,like concurrency in the local cluster, nodes down, dynamicchange of the deadline, less on-premise capacity for a givenproblem, etc. Once the ”time monitor” detects a change inthe estimated execution time, the system verifies this newestimation against the deadline time provided by the user.It is important to note that initial sequential timesteps aremonitored in order to reason whether they are predictable.In most cases, these industrial and scientific applications aresimulations that evolves over time by solving a number ofpartial differential equations represented as matrices oper-ations. Analyzing these characteristics, the timesteps canhave similar execution times, but working on distinct dataand, thus, being fairly predictable.

If the monitoring module detects that the system will notattend the time threshold, this estimated time is used tocompute the number of cores it would be needed to fit thesimulation time within the deadline. This is done using apre-processing phase to estimate the behavior of the applica-tion or it could also be an input parameter given by the user.Then, the system automatically starts the re-partitioningphase to migrate part of the domain to be computed by theelastic Cloud platform. Figure 1 and the following stepsdepict the proposed solution:

1. Monitoring the application and analyzing estimatedexecution time to decide about a Cloud bursting;

2. If a bursting is needed, save current state;

3. Compute the number of cores to be allocated in theCloud extended environment;

Figure 1: Steps of the self-adaptive method

4. Compute the size of domain (part of the full domain)to be placed in the Cloud;

5. Move current state data to new nodes (Cloud-burst);

6. Assimilate current state as initial conditions;

7. Restart simulation at the stopped step with new con-figuration (on-premise cluster and Cloud);

8. Synchronize the simulation at each timestep runningin both environments and merge results.

In order to compute the estimated deadline and for vali-dation purposes, we empirically defined the computationalbehavior of the application. By executing it for several coresand nodes configuration, a function that can represent theoverall behavior was estimated:

Lcloud(c) = −A ln c+B (1)

Lcluster(c) = −D ln c+ E. (2)

Equations 1 and 2 represent the application response asa function of number of cores, respectively for the Cloudenvironment and for an on-premise Cluster platform. Thecoefficients A, B, D and E were empirically computed forthe experimental application using a small test job (smalldata set and few timesteps) as pre-processing—see Section3. The number of cores needed to attend the time thresholdwould satisfy only the Cluster environment, but since thereis a possibility to place part of the domain to be executedin a Cloud environment, one must consider the network de-lays for migrating the data and restart the application asthe reduced computing performance inherent in a Cloud en-vironment when compared to the cluster. The correctionfactor for the difference between the performances (Clusterx Cloud) can be further defined as follows:

K =Lcloud

Lcluster

K =−A ln c+B

−D ln c+ E

where K is the computational performance correction factorand A, B, C and D are coefficients computed for a givenapplication (Section 3).

Once the system computed the new number of cores re-quired to attend the deadline, the correction factor can beapplied:

cn = (c− ccluster) ∗K (3)

where cn is the number of cores in the extended environment,c is the estimated number of cores (Equation 2) and cclusteris number of cores available in the on-premise cluster.

Figure 2: Domain split for Cloud-Bursting (γ is thesize to be placed in a Cloud)

The next step is to split the domain into partitions to beexecuted in the hybrid environment. In order to simplify

Table 1: Hardware configuration of the Cluster andCloud environments

Cluster Cloud

processor (GHz) 2.80 2.60

cores per processor 10 4

cache size (KB) 25600 20480

total memory (KB) 132127868 65711672

network ethernet/infiniband ethernet

the method, one of the two dimensions in domain (widthand height) was fixed. As depicted in Figure 2, we choseto fix the height dimension to compute the value of γ as afunction of execution time. That approach allows to definea linear relationship between the execution time (t) and thedomain size through the value of γ:

f(γ) = t = aγ + b (4)

where the linear coefficient a and the offset b must be com-puted for each application. γ must be any integer value,since it represents the number of column partitions to beplaced in the extended environment.

The Equation 4 can be written as a function of time (t):

f−1 = g(t) = γ =t− b

a. (5)

The difference between estimated total time and the time fordeadline is applied in Equation 5 to compute the size of thepartial domain (γ) to be executed by the Cloud environment.

3. EMPIRICAL STUDYThe goal of this study is to analyze the approach pre-

viously described for the purpose of creating a proof-of-concept model based on the application characteristics. Suchanalysis was carried out from the perspective of the devel-opers of cloud bursting scientific applications targeting aseismic application. Obviously, the study involves two com-puting environments: on-premise Cluster environment andthe Cloud computing environment. Both environments havedifferent hardware (e.g., memory, CPU) and software con-figurations (e.g., operating system). The cloud environmentis the IBM SoftLayer platform1. Table 1 shows the hard-ware configuration of both environments. We have pickedthe cloud node configuration offer that was the most similarto the one in our local cluster for more correct evaluations.

3.1 Target application: FWIFull Waveform Inversion (FWI) is a numerically challeng-

ing technique based on full wavefield modeling of a geolog-ical domain that extracts relevant quantitative informationfrom seismograms [7]. When dealing with realistic physics-based elastic partial differential equations formulation andaccurate discretization techniques, such as high order Spec-tral Element Method, the forward modeling becomes par-ticularly challenging from the computational point of view.This type of modeling often requires grid representation of

1http://www.softlayer.com/

http://www.softlayer.com/

the order of billions of elements with interpolation polyno-mials of at least 4th degree in each spatial dimension, whichentails hundreds of billions of variables computed for the for-ward solver. Furthermore, in the course of local optimiza-tion, the aforementioned process or its adjoint are repeatedmultiple times per non-linear inversion iteration. The com-putational complexity associated with FWI applications de-mands tremendous efforts to set up and maintain a dis-tributed scientific HPC infrastructure. These applicationsare also embedded in complex business models with variousstakeholders. Therefore, having FWI applications offered asa service in the Cloud can bring several benefits to theirusers.

Summarizing, millions of shots in different positions (i.e.,independent tasks) are calculated over the same domain (i.e.,data) and each shot solution is composed of several traces,evolving along timesteps. At the end, the generated imagesare composed to produce the final outcome. This process isdepicted in Figure 3. Above, multiple traces are producedby for shot by the acquisition ship that is moving over anarea of exploration. The ship carries the compressed airgun along with an array of seismograms that capture wavereflections produced by different subsurface geological layers.Bellow, a produced timestep, in different point-of-views, isvisualized. It represents a partial result of the reconstructedwave using our implementation of the FWI over a smallregion of interest of the area that contains a specific saltdome (we used the SEG/EAGE salt dome model).

Figure 3: Examples of FWI data collection and re-mote visualization of the forward solver

This particular implementation (C++) is parallelized bymeans of OpenMPI (over data and tasks) and uses the Eigen32

library for matrix and vector manipulation, as well as Par-aView3 to provide (remote) visualization. The applicationwas executed several times in each environment and Table 2presents the details of the executions.

Table 2: Execution detailsnumber of elements (x axis) 600

number of elements (y axis) 600

polynomial order (x axis) 4

polynomial order (y axis) 4

degrees of freedom 5764801

domain size (x axis) 9000

domain size (y axis) 6000

number of timesteps 3000

3.2 Empirical FormulationsTo validate our assumptions, we have performed some ex-

periments on the target application. The first equationswere empirically determined through the results acquiredfrom the simulations. The chart on Figure 4 depicts thatboth curves have similar behavior, but as the number ofprocessor-cores decreases to 10, the elapsed time for theFWI simulation showed an increase of 150% (105.4/105.0)worse than the on-premise cluster environment. On theother hand, as the number of processor-cores increases upto 40, the elapsed time of SoftLayer environment was onlyaround 50% (104.3/104.1) longer than the cluster environ-ment, tending to be the same with the increasing of cores.Furthermore, it is worth mentioning that these preliminarytests were executed with only a small number of cores andnodes to verify the behavior of Cloud, while running a highCPU and I/O intensive application.

From the Figure 4, we defined the equation 1 and 2 asfollows:

Lcloud(c) = −0.77 ln c+ 7.1 (6)

Lcluster(c) = −0.65 ln c+ 6.5 (7)

g(t) = γ =t− 231.18

7.46(8)

By solving equations 6 to 8 for c, we can dynamically com-pute the value of cn in equation 3 and γ in 8. Those valuesare, respectively, the number of cores and the size of the do-main placed in the Cloud environment. After applying themovement, the system is now capable to execute the sim-ulation and satisfy the deadlines. This process is repeatedwhile the system detects that the deadline time limit wouldnot be attended for any reason, thus delaying the execution.

3.3 Discussion2http://eigen.tuxfamily.org/3http://www.paraview.org/

http://eigen.tuxfamily.org/

http://www.paraview.org/

Figure 4: Execution time as a function of the num-ber of processors

In this research, we propose a method that allows to dy-namically move part of a domain to Cloud environments.In addition, the benefit of running the numerical solver in aHybrid infrastructure (cluster and a private Cloud) showedinteresting behavior when we forced the local HPC clusterto exhaust its limited resources. Besides application perfor-mance gains, the proposed self-adaptive method reduces thetime span needed for the ”time to production”infrastructure,since there is no need to wait the acquisition, installation,and setup processes as would be with in acquiring new localresources, i.e., the co-existence in a balanced way of bothplatforms can bring better price-performance to such appli-cations.

We performed the simulations over a few number of config-urations and load-balancing strategies. The graph providedby Figure 5 shows the application response for several gridsizes while holding one dimension constant. Such an exper-iment allowed to understand how the domain split strategywould work when migrating part of the domain to an exter-nal environment. This study – execution time when varyingprocessor cores – allowed to understand how to provision theCloud cores for a given application size and deadline. Wethen observed that the inverse process would also be an in-teresting way to understand how the execution time can beextrapolated to a specific number of cores. The self-adaptivestrategy is based on the equations 1 to 2, which allows todetermine a few parameters to burst the application towardsthe Cloud. Those parameters, viz. γ and cn, are sufficientto define the external infrastructure.

The overhead of the monitoring and partitioning compo-nents can be neglected. Results show that total message sizeis only 21KBytes size. The second-level partitioning strat-egy is based on stripes, which reduces the amount of commu-nications between partitions and between environments. Itis important to note that the chosen load-balancing methodover cores inside an environment is the simple greedy-basedalgorithm, where the striped-partitions are assigned to pro-cessor cores as long as they are available in each node. Thischoice reduces the inter-nodal communications as the neigh-

Figure 5: Empirical results for execution time (sec-onds) as a function of γ

bors partitions tend to be assigned to only one cluster node.In our case study, shots are independent tasks and traceshave dependencies inside a shot.

On the other hand, the overhead caused by saving the ac-tual state (checkpoint), the transferring corresponding datato the Cloud, and the provision the Cloud nodes need to beaccounted. These values are not neglected and need to beconsiderable shorter than letting the application to finish itsexecution on the local cluster. Although we need to care-fully analyze these measurements, in a seismic application –which usually takes months to produce the final result – suchautomatic checkpoint-restart process is cost-effective as thetotal execution time of this application is inferred in the be-ginning of execution. We understand that this assumptionwould cause an inaccurate estimation and we are extendingthis work to measure and include such overhead as an offsetvalue in equations 1 and 2.

Despite of the potential feature for the system to be elas-tic, in the sense that infrastructure can expand and shrinkas needed, the current approach just computes and appliesexpansions on the Cloud environment. Further studies andadvances on this system can even consider to reduce theamount of processors when the deadline threshold can beachieved with reduced infrastructure.

Finally, the proposed method can be incorporated in aframework to be reused in different applications. For that,one needs to investigate the applications behavior beforespecifying the coefficient values as described in section 2.Considering that we simplified the mesh partitioning by fix-ing one of the dimensions, we are planing to extend this workby investigating the full 2D mesh partitioning and also full3D numerical solvers.

4. RELATED WORKHigh Performance Computing applications are being tested

on Cloud platforms. Works like [2], [3] and [4] performed aperformance evaluation of a set of benchmarks and complexHPC applications on a range of platforms, from supercom-puters to clusters, both in-house and in the cloud. Thesestudies show that a Cloud can be effective for such appli-

cations mainly in the case of complementing supercomput-ers using models such as cloud burst and application-awaremapping to achieve significant cost benefits. Although thesefindings do not propose an automatic and adaptive approachfor using both environments, their empirical studies openedthe opportunity for proposals of tools that promote a hybridapproach based on these environments, like ours.

Analyzing Cloud as stand alone execution platform forHPC applications, like seismic, the authors of [5] evaluatedthe Linpack workload on the Amazon EC2 cloud. Their con-clusions indicate that the tested cloud environment has a po-tential, but it is not mature to provide a price-performancefor HPC applications. [6] also evaluated the EC2 for a num-ber of kernels used by HPC applications, coming also to aconclusion that such cloud services need an order of mag-nitude in performance improvement to better serve the sci-entific community. It is hard to evaluate one provider oranother, but they are evolving to offers that are private andwith specialized infrastructures, like with GPUs and Infini-band. In our study, we evaluated the SoftLayer Cloud (vir-tualized nodes) and preliminary results indicate the environ-ment as cost-effective in budget and performance when atleast combined with on-premise clusters in a dynamic chang-ing scenario.

More recently, [1] evaluated a computational fluid dynam-ics application over a heterogeneous environment of a clus-ter and the EC2 cloud. The results indicated that there isa need to adjust the CPU power (configuration) and work-load by means of load-balancing. We are in line with thisstudy and went further with the present work – a dynamicself-adaptive method for application load-balancing over ahybrid platform composed of cluster and cloud.

5. CONCLUSIONWe presented a first step towards a framework for self-

adaptation of industrial and scientific applications in termsof being executed on a hybrid and heterogeneous environ-ment composed of on-premise HPC clusters and the Cloud.It dynamically monitors and reasons when to migrate part ofthe computation from one platform to the other at runtimeto maximize performance and meet deadline constraints. Wedemonstrated the core method applied to a seismic applica-tion, which is data and task parallel and could reflect thebehavior of a broad range of industrial and scientific appli-cations. The method is based on a self-adaptation of theapplication at runtime to shift part of the computation tothe Cloud when the prediction of the total execution time isidentified to overcome the given execution deadline (whichcould also change dynamically). The approach prevents ad-ditional expensive capital expenditure and has an intrinsicoverhead, but due to the on-demand cloud elasticity of nodesprovisioning, the needed amount of nodes can be aggregatedto achieve the deadline.

Ongoing work is based on the tool refinement, includ-ing overheads detection and shaping into a framework tobe incorporated on the IBM’s SoftLayer Private and PublicClouds and/or even on the Bluemix platform4 as a servicetargeting FWI as a service. Moreover, the roadmap includesto work with data from a real seismic exploration field ac-quisition and therefore a more robust cluster capable of pro-cessing the problem. Also, we intend to proceed with the

4https://www.bluemix.net

analysis of nodes composed by CPU and GPU, thus includ-ing another level of heterogeneity inside the execution nodes(i.e., dynamically scheduling over the processors, like CPUand GPU, as reported in [8]).

Finally, the approach presented here at first sight mightseem like a counter-intuitive idea: replace a straightforwardembarrassingly parallel implementation of a challenging realworld problem with a more sophisticated, self-adaptive, auto-scaled, synchronized, communication intense, domain parti-tioned approach. However, we have shown that a seamlesscombination of on-premise Clusters with the Cloud has animportant contribution to boost the performance of comput-ing intensive HPC applications with better price-performanceand time-to-production.

6. ACKNOWLEDGMENTSThis work has been partially supported by FINEP/MCTI

under grant no. 03.14.0062.00.

7. REFERENCES[1] M. B. Belgacem and B. Chopard. A hybrid HPC/cloud

distributed infrastructure: Coupling EC2 cloudresources with HPC clusters to run large tightlycoupled multiscale applications. Future GenerationComputer Systems, 42(0):11 – 21, 2015.

[2] A. Gupta, L. V. Kale, F. Gioachin, V. March, C. H.Suen, B.-S. Lee, P. Faraboschi, R. Kaufmann, andD. Milojicic. The who, what, why and how of highperformance computing applications in the cloud. InProceedings of the IEEE International Conference onCloud Computing Technology and Science (CloudCom),2013.

[3] A. Gupta and D. Milojicic. Evaluation of hpcapplications on cloud. In Open Cirrus Summit (OCS),2011 Sixth, 2011.

[4] G. Mateescu, W. Gentzsch, and C. J. Ribbens. HybridComputing—Where HPC meets grid and CloudComputing. Future Generation Computer Systems,27(5), 2011.

[5] J. Napper and P. Bientinesi. Can cloud computingreach the top500? In Proceedings of the combinedworkshops on UnConventional high performancecomputing workshop plus memory access workshop,2009.

[6] S. Ostermann, A. Iosup, N. Yigitbasi, R. Prodan,T. Fahringer, and D. Epema. A performance analysis ofec2 cloud computing services for scientific computing.In International Conference on Cloud Computing(CloudComp’09). 2010.

[7] J. Virieux and S. Operto. An overview of full-waveforminversion in exploration geophysics. Geophysics,74(6):WCC1–WCC26, 2009.

[8] A. Binotto, M. Wehrmeister, A. Kuijper andC. Pereira. Sm@rtConfig: A context-aware runtime andtuning system using an aspect-oriented approach fordata intensive engineering applications. ControlEngineering Practice, 21(2):204–217, 2013.

https://www.bluemix.net

a self-adaptive auto-scaling method for scientiﬁc ...a self-adaptive auto-scaling method for...

Documents